Kaggle Analytics Prediction Competition
April 30, 2020
Editorial "we" is used in place of "I" and in the meaning of "the author and the reader". I recommend reading the I. Definition and III. Results sections before the II. Implementation.
Our analysis aimed to define common features and specific trends among "Cinderella" teams in NCAA® men's basketball. In this context, Cinderella was defined as any basketball team seeded 10th or worse that has advanced to the Round 3 in NCAA® tournament. We have divided all the remaining teams into two more categories - Top and Ordinary.
We explored, filtered and analyzed NCAA® data across different dimensions and used descriptive statistics and exploratory visualizations to summarize main characteristics about the data in general, and in particular about "Cinderellaness" as our target of interest.
Our analysis demonstrated that a typical Cinderella team is ranked between 20 and 65 in a pre-tournament rankings of a popular ranking systems. Cinderellas are good at shooting 2-pointers in the regular season, but not so much in the tournaments. The opposite is true for the three-point goals - Cinderellas have the highest three-point field goal ratio in NCAA® tournaments of all team categories. Cinderellas are successful in defensive rebounding and will likely have a positive Rebound Margin in the regular season games. They typically win with a high scoring margin in Round 2 of NCAA® tournament, but it is harder for them to keep it as high in the later rounds.
For this research, we have trained eXtreme Gradient Boosting (XGBoost) machine learning algorithm to predict which team had the best potential to become a Cinderella before the March Madness was canceled. Our model predicted that ETSU (East Tennessee State University) was the most likely candidate for a Cinderella team of the 2020 season.
Project Origin
Each season there are thousands of men's and women's NCAA® basketball games played between Division I teams, culminating in March Madness®, the national championship tournaments that start in the middle of March [1]. The men's and women's NCAA basketball tournaments are beloved American sports traditions. These are single-elimination tournaments, which means that the championship team has to win at least six games in a row to claim the title. This high-stakes environment — plus the chance to witness a crazy "Cinderella-story" upset, gives the tournament its March Madness® nickname [4].
The challenge of the "Google Cloud & NCAA® March Madness Analytics" competition, sponsored by Google Cloud and hosted by Kaggle, is to present an exploratory analysis of the March Madness® using a Kaggle Notebook.
Prerequisite Knowledge
In this study, we assume that the reader is familiar with the basic NCAA® men's basketball rules and terminology. For those new to basketball, we recommend [4] and [26] for a quick introduction.
Input Data
The input NCAA® data is provided for this competition and is available from the competition website. The data is about college basketball games and teams and is divided into 6 sections - The Basics, Team Box Scores, Geography, Public Rankings, Play-by-play and Supplements. Please refer to the Data [1] section at the bottom of this notebook for a full description of each file. On March 12, 2020, NCAA® canceled the Division I men's and women's 2020 basketball tournaments, as well as all remaining winter and spring NCAA® championships based on the evolving COVID-19 public health threat [2], so the 2020 data is incomplete and does not have an information about 2020 NCAA® tournament bracket.
The goal of our project is to use data analysis to explain "cinderellaness" - define common features and specific trends among "Cinderella" teams in NCAA® men's basketball.
The intended solution is to:
Data Exploration and Preprocessing
Scientific computing and analysis packages such as NumPy and Pandas will be used to explore and preprocess the data. Data cleaning will be performed where necessary. We will filter data across different categories, such as regular season vs. NCAA® tournament, all games vs. games won, team segment vs. metric vs. season.
The essential part of our analysis is to divide men's NCAA® basketball teams into 3 groups - Cinderella, Top and Ordinary.
A March Madness Cinderella is a team that greatly exceeds its NCAA® tournament expectations. They are generally afterthoughts on the Selection Sunday bracket, but wind up becoming one of the biggest stories of the tournament [3]. In NCAA®, the field of teams is divided into four geographical regions. Each region has between 16 and 18 teams, which are assigned a seed number of one through 18, with the best team in the region awarded the No. 1 seed. Traditionally, the highest seeds (Nos. 1 through 8) have enjoyed more success than the lower seeds (Nos. 9 through 16). The lower seeds represent potential Cinderellas of the tournament. A Cinderella team is one that unexpectedly achieves success in the tournament. Traditionally, Cinderella's chariot turns back into a pumpkin before getting to the Final Four [4] (also see Figure 1).
Considering the above definition, we decided to use the following segmentation as the foundation for our discussion and analysis:
display_img("06.png")
Here is the example segmentation result based on season 2019 data:
CINDERELLA TEAM OF 2018-19
Twelfth-seeded Oregon Ducks defeated Wisconsin 72-54 in a first-round game and beated UC Irvine 73-54 in a second-round game to advance to Sweet 16, where they lost 49-53 to a No. 1 seed Virginia, making this a classical example of a March Madness Cinderella story.
TOP TEAMS OF 2018-19
Notice how 16 teams were seeded Nos. 1 through 4, but only 14 are included in our TOP category, because 2 of the top-seeded teams (Kansas State Wildcats and Kansas Jayhawks) have not advanced to the Round 3.
Exploratory Visualizations
Data visualization libraries such as Matplotlib, Seaborn and Plotly will be used to create the exploratory visualizations. We will utilize different types of graphs, including but not limited to box plots, scatter plots and bar plots, to compare the features across different dimensions and see how features are distributed. Image graphs are called "Figure ..." and interactive graphs (that are responsive to mouse-over events) are called "... Interactive graph" for readers' convenience.
Statistical Analysis
We will use descriptive statistics and measures of central tendency, such as mean (the average) and median (the middle value), to quantitatively describe and summarize our features of interest. Considering that Top category teams are expected to outperform two other categories in most cases, we want to focus more on comparing Cinderella teams to Ordinary teams, for example analyze some metric for Cinderellas vs. median value of the same metric for Ordinary teams.
Machine Learning
We will use machine learning techniques to speculate which teams could become Cinderellas in season 2020 if the tournament would not be canceled. Our intention is to try out different classifiers and choose whichever performs best. While machine learning is not the main focus of our study, we might also want to apply some model refinement techniques to meet a certain threshold. We will use Scikit-learn, XGBoost and Imbalanced-learn modules to implement model training, evaluation and improvement.
🏀 CLICK HERE TO SKIP TO THE RESULTS 🏀
%matplotlib inline
# Import packages:
import numpy as np
import pandas as pd
pd.set_option('mode.chained_assignment', None)
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# Define default seaborn plot params:
sns.set(rc={'figure.figsize':(14,10)})
sns.set_palette("colorblind")
# Define default matplotlib plot params:
params = {'figure.figsize':(14,10),
'figure.titlesize':16,
'axes.titlesize':'x-large',
'axes.labelsize':'large',
'xtick.labelsize':'large',
'ytick.labelsize':'large',
'legend.fontsize':'large'}
pylab.rcParams.update(params)
# Define default plotly plot params:
plotly_width = 880
import warnings
warnings.filterwarnings("ignore")
file_nr = 1
def save_plot():
'''Save plot into a ##.png file'''
global file_nr
if sys.executable != '/opt/conda/bin/python': # if running this notebook locally
plt.savefig('kaggle/working/' + (str(file_nr).zfill(2)) + '.png', bbox_inches='tight', pad_inches=1)
else:
plt.savefig((str(file_nr).zfill(2)) + '.png', bbox_inches='tight', pad_inches=0.5)
print("File nr. {}".format(file_nr))
file_nr +=1
import sys
men_dir = "/kaggle/input/march-madness-analytics-2020/MDataFiles_Stage2/"
if sys.executable != '/opt/conda/bin/python':
# remove the forward slash if running this notebook locally:
men_dir = men_dir[1:]
def load_file(df, name):
'''Load in the file and show basic info'''
print("File: {}".format(name))
df = pd.read_csv(men_dir + name + '.csv')
print("Num rows: {}".format(len(df)))
print("NaN values: {}".format(df.isna().sum().sum()))
print("Duplicated rows: {}".format(df.duplicated().sum()))
print(pd.concat([df.head(3), df.tail(2)]))
return(df)
Data Section 1 file: MRegularSeasonCompactResults.csv - this file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament) [1].
We will check each file that we load for a data quality issues like Null values and duplicate rows.
MRegularSeasonCompactResults = None
MRegularSeasonCompactResults = load_file(MRegularSeasonCompactResults, "MRegularSeasonCompactResults")
Data Section 1 file: MNCAATourneyCompactResults.csv - this file identifies the game-by-game NCAA® tournament results for all seasons of historical data [1].
MNCAATourneyCompactResults = None
MNCAATourneyCompactResults = load_file(MNCAATourneyCompactResults, "MNCAATourneyCompactResults")
Calculate scoring margin (a difference between the number of points scored by the winning team and the number of points scored by the losing team) for both dataframes:
for df in [MRegularSeasonCompactResults, MNCAATourneyCompactResults]:
df['Scoring margin'] = df['WScore'] - df['LScore']
MRegularSeasonCompactResults.sample(3)
MNCAATourneyCompactResults.sample(3)
See how many games were won on each of the locations:
MRegularSeasonCompactResults.WLoc.value_counts()
Total regular season games that were not on the neutral court:
len(MRegularSeasonCompactResults[MRegularSeasonCompactResults.WLoc != "N"])
Create a plot. Note that we have added all figure numbers later, after we knew their order in the Results section.
colors = [sns.color_palette("cubehelix", 10)[6], sns.color_palette("cubehelix", 10)[1], 'gold']
df = MRegularSeasonCompactResults[MRegularSeasonCompactResults.WLoc != "N"]
print(f'{df.Season.min()}-{df.Season.max()}')
sns.scatterplot(x="LScore", y="WScore", data=df,
hue="WLoc", palette=colors[:-1], edgecolor=None, s=50, alpha=0.35)
plt.xlabel("Points scored by the losing team")
plt.ylabel("Points scored by the winning team")
ax = plt.gca()
legend = ax.legend()
legend.texts[0].set_text("Location")
legend.texts[1].set_text("Home")
legend.texts[2].set_text("Visiting")
plt.title('Figure 7. Points scored vs. home or visiting winner team,\n 150K regular season games, 1985-2020.\n')
save_plot()
plt.show()
Total regular season games in our data:
len(MRegularSeasonCompactResults)
### Plot 1 ###
df = MRegularSeasonCompactResults
print(f'{df.Season.min()}-{df.Season.max()}')
sns.lineplot(x="Season", y="Scoring margin", data=df,
hue="WLoc", hue_order=['H', 'A', 'N'],
palette=colors)
plt.xlabel("Season")
ax = plt.gca()
legend = ax.legend()
legend.texts[0].set_text("Location")
legend.texts[1].set_text("Home")
legend.texts[2].set_text("Visiting")
legend.texts[3].set_text("Neutral")
plt.title('Figure 8. Scoring margin vs. winner team location (including neutral court games),\n 167K regular season games, 1985-2020.\n')
save_plot()
plt.show()
### Plot 2 ###
plt.figure(figsize=(10,8))
sns.scatterplot(x="Season", y="Scoring margin", data=MRegularSeasonCompactResults.sample(1000, random_state=0),
hue="WLoc", edgecolor='w', alpha=0.5, s=75, hue_order=['H', 'A', 'N'],
palette=colors)
plt.xlabel("Season")
ax = plt.gca()
legend = ax.legend()
legend.texts[0].set_text("Location")
legend.texts[1].set_text("Home")
legend.texts[2].set_text("Visiting")
legend.texts[3].set_text("Neutral")
plt.title('a closer look: random sample of 1000 games\n')
plt.show()
print("Descriptive statistics for file nr. {}:".format(str(file_nr-1)))
MRegularSeasonCompactResults[['WLoc', 'Scoring margin']].groupby('WLoc').describe()
The smaller plot ("a closer look") did not make it to the Results section, but you can see how games on a visiting court almost never had a scoring margin above 30 (for a winning teams on this particular sample).
Create a game round column for NCAA® tournaments
Because of the consistent structure of the tournament schedule, we can actually tell what round a game was, depending on the exact DayNum [1]. Thus:
MNCAATourneyCompactResults['Round'] = MNCAATourneyCompactResults['DayNum'] # copy DayNum column
MNCAATourneyCompactResults['Round'].replace({134: "Play-in",
135: "Play-in",
136: "Round 1",
137: "Round 1",
138: "Round 2",
139: "Round 2",
143: "Sweet 16",
144: "Sweet 16",
145: "Elite 8",
146: "Elite 8",
152: "Final 4",
154: "National Final"}, inplace=True) # replace values with round names
# Also add numerical round values for easier sorting:
MNCAATourneyCompactResults['NumRound'] = MNCAATourneyCompactResults['DayNum'] # copy DayNum column
MNCAATourneyCompactResults['NumRound'].replace({134: 0,
135: 0,
136: 1,
137: 1,
138: 2,
139: 2,
143: 3,
144: 3,
145: 4,
146: 4,
152: 5,
154: 6}, inplace=True) # replace values with round names
MNCAATourneyCompactResults.sample(3)
The men’s college basketball tournament is made up of 68 teams. On Selection Sunday, before any tournament game is played, those teams are ranked 1 through 68 by the Selection Committee, with the best team in college basketball — based on regular season and conference tournament performance — sitting at No. 1. Four of those teams are eliminated in the opening round of the tournament (known as the First Four), leaving us with a field of 64 for the first round. Those 64 teams are split into four regions of 16 teams each, with each team being ranked 1 through 16. That ranking is the team’s seed [23].
MNCAATourneySeeds = pd.read_csv(men_dir + '/MNCAATourneySeeds.csv')
MNCAATourneySeeds.sample(5)
MNCAATourneySeeds['SeedNo'] = MNCAATourneySeeds.Seed.str.extract('(\d+)').astype(np.int64)
MNCAATourneySeeds.sample(5)
Merge dataframes on season and winner team ID:
len(MNCAATourneyCompactResults)
MNCAATourneyCompactResults = pd.merge(MNCAATourneyCompactResults,
MNCAATourneySeeds,
how='inner',
left_on=['Season', 'WTeamID'],
right_on=['Season', 'TeamID'])
MNCAATourneyCompactResults = MNCAATourneyCompactResults.drop('TeamID', 1)
MNCAATourneyCompactResults
To filter out Cinderella teams we will look at any basketball team seeded 10th or worse that has advanced to the Round 3
Seeded 10th or worse:
# Seeded 10th or worse:
possible_cinderellas = MNCAATourneyCompactResults[MNCAATourneyCompactResults['SeedNo'] >= 10]
possible_cinderellas
Advanced to the Round 3:
# Round 2 is DayNum=138 or 139 (Sat/Sun), to bring the tournament field from 32 teams to 16 teams (to SWEET 16):
cinderellas = possible_cinderellas[possible_cinderellas['DayNum'] >= 138] # played in Round 2
cinderellas["Cinderella"] = 1
cinderellas = cinderellas[['Season', 'WTeamID', 'Cinderella']].drop_duplicates() # won in Round 2 (will play in Round 3)
cinderellas
Data Section 1 file: MTeams.csv - this file identifies the different college teams present in the dataset. Each school is uniquely identified by a 4 digit id number [1].
MTeams = None
MTeams = load_file(MTeams, "MTeams")
# Group by season and winner team id:
season_team_cinderellas = cinderellas.groupby(['Season','WTeamID'], as_index=False).mean()
season_team_cinderellas = season_team_cinderellas.sort_values(by='Season')
# Print out the resulting list of cinderella teams:
for index, row in season_team_cinderellas.iterrows():
team_id = season_team_cinderellas['WTeamID'][index]
print("Season: {}; Team: {}".format(season_team_cinderellas['Season'][index], MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]))
plt.figure(figsize=(14,4))
df = season_team_cinderellas
print(f'{df.Season.min()}-{df.Season.max()}')
g = sns.countplot(season_team_cinderellas.Season, palette=sns.color_palette("colorblind")[1:2])
g.set_xticklabels(g.get_xticklabels(), rotation=45)
plt.ylabel("Cinderella teams")
plt.title("Figure 9. Cinderella team count per season,\n1985-2019.")
save_plot()
plt.show()
A closer look at one example (2019, team Oregon):
# See the 2019 example:
MNCAATourneyCompactResults[((MNCAATourneyCompactResults['WTeamID'] == 1332) | (MNCAATourneyCompactResults['LTeamID'] == 1332))
& (MNCAATourneyCompactResults['Season'] == 2019)]
From above table: team 1332 (Oregon) won in Round 1 and Round 2, and lost in the Sweet 16 to the team 1438.
Make a separate group for the top-seeded teams who has advanced to the Round 3.
This group will represent most competitive teams (high seed and high performance).
# Seeded 1, 2, 3 or 4:
top_seeded = MNCAATourneyCompactResults[MNCAATourneyCompactResults['SeedNo'] <= 4]
# Round 2 is DayNum=138 or 139 (Sat/Sun), to bring the tournament field from 32 teams to 16 teams (to SWEET 16):
top_seeded = top_seeded[top_seeded['DayNum'] >= 138]
top_seeded["Top"] = 1
top_seeded = top_seeded[['Season', 'WTeamID', 'Top']].drop_duplicates()
top_seeded
A closer look at 2019 season:
# Group by season and team id in SEASON 2019 ONLY:
season_team_top_2019 = top_seeded[top_seeded["Season"] == 2019].groupby(['Season','WTeamID'], as_index=False).mean()
# Print out the resulting list of top teams in SEASON 2019 ONLY:
print("Season 2019 Top teams:\n")
for index, row in season_team_top_2019.iterrows():
team_id = season_team_top_2019['WTeamID'][index]
print("Season: {}; Team: {}".format(season_team_top_2019['Season'][index], MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]))
Filter by season - we don't want to include seasons without any cinderella teams:
# Filter by season - we don't want to include seasons without any cinderella teams:
### Regular season ###
labeled_MRegularSeasonCompactResults = MRegularSeasonCompactResults[MRegularSeasonCompactResults['Season'].isin(season_team_cinderellas['Season'].tolist())]
### Tournaments ###
labeled_MNCAATourneyCompactResults = MNCAATourneyCompactResults[MNCAATourneyCompactResults['Season'].isin(season_team_cinderellas['Season'].tolist())]
labeled_MNCAATourneyCompactResults
Next, finish encoding labels. Merge initial dataframes (regular season and tournament data) with our lists of Cinderella and Top teams (on season and winner team ID):
### Regular season ###
print(len(labeled_MRegularSeasonCompactResults))
labeled_MRegularSeasonCompactResults = pd.merge(labeled_MRegularSeasonCompactResults,
cinderellas,
how='left',
on=['Season', 'WTeamID'])
labeled_MRegularSeasonCompactResults = pd.merge(labeled_MRegularSeasonCompactResults,
top_seeded,
how='left',
on=['Season', 'WTeamID'])
labeled_MRegularSeasonCompactResults
### Tournaments ###
print(len(labeled_MNCAATourneyCompactResults))
labeled_MNCAATourneyCompactResults = pd.merge(labeled_MNCAATourneyCompactResults,
cinderellas,
how='left',
on=['Season', 'WTeamID'])
labeled_MNCAATourneyCompactResults = pd.merge(labeled_MNCAATourneyCompactResults,
top_seeded,
how='left',
on=['Season', 'WTeamID'])
labeled_MNCAATourneyCompactResults
Create a categorical LABEL column
### Regular season ###
# Create a categorical LABEL column:
label = labeled_MRegularSeasonCompactResults[['Cinderella', 'Top']]
label = pd.DataFrame(label.idxmax(1))
labeled_MRegularSeasonCompactResults['LABEL'] = label
# Fill in the missing values:
labeled_MRegularSeasonCompactResults['LABEL'] = labeled_MRegularSeasonCompactResults['LABEL'].fillna("Ordinary")
labeled_MRegularSeasonCompactResults
### Tournaments ###
# Create a categorical LABEL column:
label = labeled_MNCAATourneyCompactResults[['Cinderella', 'Top']]
label = pd.DataFrame(label.idxmax(1))
labeled_MNCAATourneyCompactResults['LABEL'] = label
# Fill in the missing values:
labeled_MNCAATourneyCompactResults['LABEL'] = labeled_MNCAATourneyCompactResults['LABEL'].fillna("Ordinary")
# Sort value by round:
labeled_MNCAATourneyCompactResults = labeled_MNCAATourneyCompactResults.sort_values(by='NumRound', ascending=False) # 6, 5, 4...
labeled_MNCAATourneyCompactResults
Check the results of data segmentation:
### Regular season ###
labeled_MRegularSeasonCompactResults.LABEL.value_counts()
### Tournaments ###
labeled_MNCAATourneyCompactResults.LABEL.value_counts()
Fill in the missing values:
### Regular season ###
# Fill in the missing values:
labeled_MRegularSeasonCompactResults['Cinderella'] = labeled_MRegularSeasonCompactResults['Cinderella'].fillna(0) # not a cinderella
labeled_MRegularSeasonCompactResults['Top'] = labeled_MRegularSeasonCompactResults['Top'].fillna(0) # not a top
labeled_MRegularSeasonCompactResults
### Tournaments ###
# Fill in the missing values:
labeled_MNCAATourneyCompactResults['Cinderella'] = labeled_MNCAATourneyCompactResults['Cinderella'].fillna(0) # not a cinderella
labeled_MNCAATourneyCompactResults['Top'] = labeled_MNCAATourneyCompactResults['Top'].fillna(0) # not a top
labeled_MNCAATourneyCompactResults
Define label order and colors for future plots:
# Label order in all plots:
order=['Ordinary', 'Cinderella', 'Top']
# Label colors in all plots:
sns.palplot(sns.color_palette("colorblind", 3))
# Prepare a functions that will help us compare Cinderella teams vs. Ordinary teams:
def cinderella_vs_ordinary(df, games, season, metric_name):
'''A function to print comparison of Cinderella team metric
vs. Ordinary team median value of the same metric'''
df_cinderella = df[df.Cinderella == 1.0]
df_ordinary = df[df.LABEL == 'Ordinary']
total_cinderella_games = len(df_cinderella)
total_ordinary_games = len(df_ordinary)
cinderella_mean = round(df_cinderella[metric_name].mean(), 2)
ordinary_mean = round(df_ordinary[metric_name].mean(), 2)
cinderella_median = round(df_cinderella[metric_name].median(), 2)
ordinary_median = round(df_ordinary[metric_name].median(), 2)
### MORE THAN ORDINARY MEDIAN
total_larger = len(df_cinderella[df_cinderella[metric_name] >
ordinary_median])
total_larger_ordinary = len(df_ordinary[df_ordinary[metric_name] >
ordinary_median])
share = total_larger/total_cinderella_games
share_ordinary = total_larger_ordinary/total_ordinary_games
def print_share_message(s='more'):
'''Input string: "more" or "less"'''
print("\nIn {} of games {} in {}, Cinderella teams had {} than {} {} "
"(mean: {}, median: {}) vs. {} of games "
"for the Ordinary teams (mean: {}, median: {}).".format(share_str, games, season, s,
ordinary_median, metric_name,
cinderella_mean, cinderella_median,
share_ordinary_str, ordinary_mean, ordinary_median))
if share > 0.51:
share_str = '{:.0%}'.format(share)
share_ordinary_str = '{:.0%}'.format(share_ordinary)
print_share_message("more")
### LESS THAN ORDINARY MEDIAN
total_less = len(df_cinderella[df_cinderella[metric_name] <
ordinary_median])
total_less_ordinary = len(df_ordinary[df_ordinary[metric_name] <
ordinary_median])
share = total_less/total_cinderella_games
share_ordinary = total_less_ordinary/total_ordinary_games
if share > 0.52:
share_str = '{:.0%}'.format(share)
share_ordinary_str = '{:.0%}'.format(share_ordinary)
print_share_message("less")
Plot scoring margin distribution vs. winner team category:
# Define mean "triangle" marker for boxplots:
meanprops={"markerfacecolor":"white", "markeredgecolor":"white"}
df = labeled_MRegularSeasonCompactResults
print(f'{df.Season.min()}-{df.Season.max()}')
fig, ax = plt.subplots(2,1, figsize = (14, 8), sharex=True)
sns.boxplot(x='Scoring margin', y='LABEL', data=labeled_MRegularSeasonCompactResults, showmeans=True, ax=ax[0],
order=order,
orient='h',
meanprops=meanprops, showfliers = False, width=0.5)
sns.boxplot(x='Scoring margin', y='LABEL', data=labeled_MNCAATourneyCompactResults, showmeans=True, ax=ax[1],
order=order,
orient='h',
meanprops=meanprops, showfliers = False, width=0.5)
ax[0].set_title('Regular season')
ax[0].set_xlabel("")
ax[0].set_ylabel("")
ax[1].set_title('Tournaments')
ax[1].set_ylabel("")
plt.suptitle("Figure 10. Scoring margin distribution vs. winner team category,\n1985-2019.", y = 1.05)
save_plot()
plt.show()
print("Descriptive statistics for file nr. {}:".format(str(file_nr-1)))
print('\nRegular season')
print(labeled_MRegularSeasonCompactResults.groupby(['LABEL'])["Scoring margin"].describe())
print('\nTournaments')
print(labeled_MNCAATourneyCompactResults.groupby(['LABEL'])["Scoring margin"].describe())
cinderella_vs_ordinary(labeled_MRegularSeasonCompactResults, "won", "regular season", "Scoring margin")
# Labels for round ticks:
df = labeled_MNCAATourneyCompactResults.groupby(['NumRound', 'Round'], as_index=False).count()[['Round']]
list(df['Round'])
g = sns.lineplot(x="NumRound", y="Scoring margin", hue="LABEL", data=labeled_MNCAATourneyCompactResults,
hue_order=order, ci=None)
plt.title("Figure 11. Mean scoring margin vs. round and winner team category,\ntournaments, 1985-2019.\n")
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles=handles[1:], labels=labels[1:])
plt.xlabel("Round")
g.set_xticklabels([0] + list(df['Round']))
save_plot()
plt.show()
print("Descriptive statistics for file nr. {}:".format(str(file_nr-1)))
labeled_MNCAATourneyCompactResults.groupby(['NumRound','LABEL'])["Scoring margin"].describe()
In order to reward better teams, first-round matchups are determined by pitting the top team in the region against the bottom team (No. 1 vs. No. 16). Then the next highest vs. the next lowest (No. 2 vs. No. 15), and so on. In theory, this means that the 1 seeds have the easiest opening matchup to win in the bracket [23].
df = labeled_MNCAATourneyCompactResults[['SeedNo', 'Round', 'NumRound', 'WTeamID',
'LABEL']].groupby(['SeedNo', 'Round', 'NumRound',
'LABEL'], as_index=False).count()
df = df.sort_values(by='NumRound', ascending=False) # 6, 5, 4...
sns.swarmplot("SeedNo", "Round", hue="LABEL", hue_order=order, data=df, size=15, palette="colorblind")
plt.xlim(0,17)
plt.xticks(np.arange(1, 17, step=1))
plt.xlabel("Seed nr.")
plt.legend(title=None)
plt.title("Figure 1. Data segmentation. Presense of team category in each round\nby seed number, 1985-2019 tournaments.\n")
save_plot()
sns.despine()
# Check font family (to use in fig.update_layout):
plt.rcParams['font.family']
# Count how many times this label won (per season):
df = labeled_MNCAATourneyCompactResults[['SeedNo', 'Round', 'NumRound', 'WTeamID',
'LABEL']].groupby(['SeedNo', 'Round', 'NumRound',
'LABEL'], as_index=False).count()
df = df.sort_values(by='NumRound') # 1, 2, 3...
# Prepare hover text:
hover_text = []
for index, row in df.iterrows():
hover_text.append(('Seed no.: {SeedNo}<br>'+
'Team category: {LABEL}<br>'+
'Total games won: {WTeamID}').format(SeedNo=row['SeedNo'],
LABEL=row['LABEL'],
WTeamID=row['WTeamID']))
df['text'] = hover_text
# Create figure
fig = go.Figure()
i = 0
for label in order:
plot_df = df[df.LABEL == label]
size = plot_df['WTeamID']
fig.add_trace(go.Scatter(
x=plot_df['SeedNo'], y=plot_df['Round'],
mode='markers',
text=plot_df['text'],
name=label,
marker_size=plot_df['WTeamID'],
marker=dict(
size=size,
sizemode='area',
sizeref=0.18, # setting 'sizeref' to less than 1 increases marker sizes
sizemin=2,
line_width=3,
line_color=sns.color_palette("colorblind").as_hex()[i]), # outline color
marker_color='rgba(0, 0, 0, 0)' # inside color
))
i+=1
# Move legend:
fig.update_layout(legend=dict(x=0.835, y=0.95, bgcolor='rgba(0, 0, 0, 0)'))
# Add titles:
fig.update_xaxes(title_text='Seed no.')
fig.update_yaxes(title_text='Round')
# Improve tick frequency:
fig.update_layout(xaxis = dict(tickmode = 'array', tickvals = list(range(1, 17))))
# Set size:
fig.update_layout(width=plotly_width, height=650)
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Plot title:
fig.update_layout(
title={
'text': "Games won by team category and seed number,<br>1985-2019 tournaments. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"))
fig.show(renderer="kaggle")
fig_2 = go.Figure(fig) # to show the same fig in the Results section
print("Total games won per round:")
df.groupby(['NumRound','LABEL'])['WTeamID'].sum()
sns.swarmplot(x="NumOT", y='WTeamID', hue="LABEL", hue_order=order,
data=labeled_MNCAATourneyCompactResults,
alpha=0.75,
s=8)
plt.legend(title=None, bbox_to_anchor=(0.575, 1), loc=2)
plt.xlabel("Number of overtime periods in the game")
ax = plt.gca()
ax.get_yaxis().set_visible(False)
plt.title("Games won vs. number of overtime periods in the game,\ntournaments, 1985-2019.\n")
plt.show()
labeled_MNCAATourneyCompactResults.groupby(['NumOT', 'LABEL'])['WTeamID'].count()
Note. We decided to not include this figure to the Results section, because we found it not especially insightful in terms of "Cinderellaness".
# Locations for games played
Cities = None
Cities = load_file(Cities, "Cities")
Cities.State.value_counts().nunique() # how many unique states do we have?
Data Section 3 file: MGameCities.csv - this file identifies all games, starting with the 2010 season, along with the city that the game was played in. Games from the regular season, the NCAA® tourney, and other post-season tournaments, are all listed together [1].
MGameCities = None
MGameCities = load_file(MGameCities, "MGameCities")
Merge both geography data files together:
new_MGameCities = pd.merge(MGameCities, Cities, on=['CityID'])
assert len(new_MGameCities) == len(MGameCities), "Wrong item count."
new_MGameCities
Load the file with each city geo location [24]:
# Load the file with each city geo location:
geo_file = "/kaggle/input/ncaageocities/geo_Cities.csv"
if sys.executable != '/opt/conda/bin/python':
# remove the forward slash if running this notebook locally:
geo_file = geo_file[1:]
geo_Cities = pd.read_csv(geo_file)
geo_Cities.sample(5)
Join our new dataframe with the geo data:
cols = ['CityID', 'City', 'State']
geo_MGameCities = new_MGameCities.join(geo_Cities.set_index(cols), on=cols)
assert len(new_MGameCities) == len(geo_MGameCities), "Wrong item count."
geo_MGameCities
geo_MRegularSeasonCompactResults = pd.merge(geo_MGameCities[geo_MGameCities['CRType'] == 'Regular'],
MRegularSeasonCompactResults,
how='inner',
on=['Season', 'DayNum', 'WTeamID', 'LTeamID'],
validate="one_to_one")
geo_MRegularSeasonCompactResults
Game location should be "H":
# Group by 'WTeamID', mean:
team_homes = geo_MRegularSeasonCompactResults[geo_MRegularSeasonCompactResults["WLoc"] == "H"][['WTeamID',
'CityID']].groupby(['WTeamID'], as_index = False).mean()
Testing out that each team has one unique home city:
# Testing out that each team has one unique home city
np.array_equal(team_homes.CityID, team_homes.CityID.astype(int)) # output should be True
Check what went wrong in the above code:
# Checking what went wrong in the above code:
team_homes = geo_MRegularSeasonCompactResults[geo_MRegularSeasonCompactResults["WLoc"] == "H"][['WTeamID','CityID', 'Season']].groupby(['WTeamID','CityID'], as_index = False).mean()
team_homes[team_homes['WTeamID'].duplicated(keep=False)]
team_homes[team_homes['WTeamID'] == 1437]
The output shows that some teams have different home cities. Let's look at one specific example to further investigate this:
MTeams[MTeams.TeamID == 1437]
Cities[Cities.CityID.isin([4266, 4361, 4467])]
This example confirms that sometimes one team has different home locations in the game-by-game data file. These three cities (Philadelphia, Villanova, Bryn Mawr) are close to each other, so this is not an error but rather a data specific issue.
To avoid this kind of inconsistency, we will drop the duplicates. Note that we didn't investigate which of the cities is bigger or more significant for each team, here we only care about keeping one city per team.
# Keep the home city from the latest season:
team_homes = team_homes.drop_duplicates('WTeamID')
team_homes[team_homes['WTeamID'] == 1437] # should be only one row
Cities[Cities.CityID == 4266]
Drop Season column, it makes no sense after applying mean. Also rename the columns:
team_homes = team_homes.drop('Season', 1)
team_homes = team_homes.rename(columns={"WTeamID": "TeamID",
"CityID": "HomeCityID"}) # rename columns
team_homes
Add the geo location data:
team_homes = pd.merge(team_homes, geo_Cities, left_on='HomeCityID', right_on='CityID', how='left')
team_homes = team_homes.drop('CityID', 1)
team_homes
Count teams per each home town
team_homes_cnt = team_homes.groupby(['HomeCityID', 'City', 'State', 'Latitude', 'Longitude'], as_index=False).count()
# Sort values for the bigger points to show above the small points:
team_homes_cnt = team_homes_cnt.sort_values(by='TeamID')
team_homes_cnt
Just a test, team Michigan St should be from East Lansing:
team_homes[team_homes['TeamID'] == 1277]
team_homes_cnt.TeamID.value_counts()
# This module allows to avoid overlapping text on scatter plots:
# Credit: https://github.com/Phlya/adjustText (The MIT License)
!pip install adjustText
# Parse SVG paths into matplotlib Path objects for plotting:
# Credit: https://github.com/nvictus/svgpath2mpl (The 3-Clause BSD License)
!pip install svgpath2mpl matplotlib
import os
os.environ['PROJ_LIB'] = 'C:\\Users\\Ivanna\\Anaconda3\\pkgs\\basemap-1.2.0-py37h4e5d7af_0\\Lib\\site-packages\\mpl_toolkits\\basemap\\data\\'
from mpl_toolkits.basemap import Basemap
from adjustText import adjust_text
from svgpath2mpl import parse_path
import matplotlib.patheffects as path_effects
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
df = geo_MRegularSeasonCompactResults
print(f'{df.Season.min()}-{df.Season.max()}')
# Create US map:
map = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, fix_aspect=False)
map.drawmapboundary(fill_color='#cee2ee', linewidth=0)
map.fillcontinents(color='#fbf7f4')
map.drawcountries(linewidth=0.25)
map.drawcoastlines(linewidth=0.25)
map.drawstates(color='0.5', linewidth=0.25) # draw the American state border
# Create custom marker:
ball = parse_path("""M297,148.5C297,66.617,230.383,0,148.5,0S0,66.617,0,148.5S66.617,297,148.5,297S297,230.383,297,148.5z M211.044,156.5
h-54.877v124.252c-2,0.158-5.314,0.248-8,0.248c-2.687,0-5-0.09-8-0.248V156.5H85.956c-1.665,31.936-13.236,61.29-31.687,85.051
c-3.826-3.874-7.413-7.982-10.743-12.3c15.244-20.59,24.815-45.614,26.398-72.751H16.249c-0.159-2.648-0.249-5.314-0.249-8
s0.09-5.352,0.249-8h53.676c-1.582-27.137-11.154-52.162-26.397-72.751c3.329-4.318,6.917-8.427,10.742-12.3
C72.72,79.21,84.292,108.563,85.956,140.5h54.211V16.248c3-0.158,5.313-0.248,8-0.248c2.686,0,6,0.09,8,0.248V140.5h54.877
c1.664-31.937,13.236-61.29,31.687-85.051c3.825,3.873,7.413,7.981,10.742,12.3c-15.243,20.589-24.815,45.614-26.397,72.751h53.676
c0.159,2.648,0.249,5.314,0.249,8s-0.09,5.352-0.249,8h-53.676c1.583,27.137,11.154,52.161,26.398,72.751
c-3.33,4.317-6.917,8.426-10.743,12.3C224.28,217.79,212.709,188.436,211.044,156.5z""")
# Create a custom cmap based on a 'YlOrBr':
YlOrBr = cm.get_cmap('YlOrBr', 100)
newcmp = ListedColormap(YlOrBr(np.linspace(0.3, 1, 256)))
# Plot data on a map:
map.scatter(team_homes_cnt['Longitude'], # longitude goes first
team_homes_cnt['Latitude'], # latitude goes second
s=pow(team_homes_cnt['TeamID']*50, 1.5), # marker size
c=team_homes_cnt['TeamID'], # marker color
marker=ball,
alpha=0.8,
zorder=10,
cmap=newcmp)
# Annotate biggest points:
top = team_homes_cnt[team_homes_cnt['TeamID'] >= 3]
top_texts = [plt.text(top['Longitude'][i]+0.5,
top['Latitude'][i]-0.5,
top['City'][i],
zorder=11) for i in top.index]
# Add white outline to text:
for text in top_texts:
text.set_path_effects([path_effects.Stroke(linewidth=3, foreground='white', alpha=.8),
path_effects.Normal()])
# Fix overlapping text:
adjust_text(top_texts)
plt.title("Figure 2. Top US cities by NCAA® men's basketball team count,\nregular season, 2010-2020.\n")
save_plot()
plt.show()
team_homes_cnt[team_homes_cnt['TeamID'] > 1].tail(10)
Print out the team names for Philadelphia:
for index, row in team_homes[team_homes['City'] == "Philadelphia"].iterrows():
team_id = team_homes[team_homes['City'] == "Philadelphia"]['TeamID'][index]
print(MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0])
MNCAATourneyDetailedResults = None
MNCAATourneyDetailedResults = load_file(MNCAATourneyDetailedResults, 'MNCAATourneyDetailedResults')
We will be using the below dictionary to look up the index of columns:
print({c: i for i, c in enumerate(MNCAATourneyDetailedResults.columns)})
Make each row in our dataframe about either a winning or a losing team stats
# Columns about winning team:
winning = pd.concat([MNCAATourneyDetailedResults.iloc[:,:4], # game ID columns
MNCAATourneyDetailedResults.iloc[:,4:5], # LTeamID
MNCAATourneyDetailedResults.iloc[:,8:21], # WFGM, WFGA, WFGM3 ...
MNCAATourneyDetailedResults.iloc[:,27:29]], # opponent OR, DR
axis=1, sort=False)
winning['TeamID'] = winning['WTeamID']
winning['won'] = 1
winning # 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID'...
# Columns about losing team:
losing = pd.concat([MNCAATourneyDetailedResults.iloc[:,:3],
MNCAATourneyDetailedResults.iloc[:,5:6], # LScore
MNCAATourneyDetailedResults.iloc[:,4:5], # LTeamID
MNCAATourneyDetailedResults.iloc[:,21:34],
MNCAATourneyDetailedResults.iloc[:,14:16]], # opponent OR, DR
axis=1, sort=False)
losing['TeamID'] = losing['LTeamID']
losing['won'] = 0
losing # 'Season', 'DayNum', 'WTeamID', 'LScore', 'LTeamID'...
print(list(winning))
print(list(losing))
Resulting dataframe object will have a "double_" prefix, because each game is now represented twice - one row for a winning team and one row for a losing team:
# Remove "W" and "L" prefixes:
new_columns = ['Season', 'DayNum', 'WTeamID', 'Score', 'LTeamID', # changed only "Score" here
'FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF',
'OppOR', 'OppDR', 'TeamID', 'won']
# Rename columns:
winning.columns = new_columns
losing.columns = new_columns
# Concatenate:
frames = [winning, losing]
double_MNCAATourneyDetailedResults = pd.concat(frames)
assert(len(double_MNCAATourneyDetailedResults) == (len(winning) + len(losing)))
double_MNCAATourneyDetailedResults
Calculate Rebound Margin
Rebound Margin = RPG - OPP RPG [15]
If a team won, opponent is LTeamID, otherwise opponent is WTeamID. Following this logic we already created two columns with opponent rebounds - "OppOR" and "OppDR".
Total rebounds per game = offensive rebounds + defensive rebounds:
double_MNCAATourneyDetailedResults['Rebound Margin'] = (double_MNCAATourneyDetailedResults['OR'] +
double_MNCAATourneyDetailedResults['DR']) - \
(double_MNCAATourneyDetailedResults['OppOR'] +
double_MNCAATourneyDetailedResults['OppDR'])
double_MNCAATourneyDetailedResults.sample(3)
Add categorical round information:
double_MNCAATourneyDetailedResults['Round'] = double_MNCAATourneyDetailedResults['DayNum'] # copy DayNum column
double_MNCAATourneyDetailedResults['Round'].replace({134: "Play-in",
135: "Play-in",
136: "Round 1",
137: "Round 1",
138: "Round 2",
139: "Round 2",
143: "Sweet 16",
144: "Sweet 16",
145: "Elite 8",
146: "Elite 8",
152: "Final 4",
154: "National Final"}, inplace=True) # replace values with round names
# Also add numerical round values for easier sorting:
double_MNCAATourneyDetailedResults['NumRound'] = double_MNCAATourneyDetailedResults['DayNum'] # copy DayNum column
double_MNCAATourneyDetailedResults['NumRound'].replace({134: 0,
135: 0,
136: 1,
137: 1,
138: 2,
139: 2,
143: 3,
144: 3,
145: 4,
146: 4,
152: 5,
154: 6}, inplace=True) # replace values with round names
double_MNCAATourneyDetailedResults['Round'].value_counts()
FGA2 = FGA - FGA3
double_MNCAATourneyDetailedResults['FGA2'] = double_MNCAATourneyDetailedResults['FGA'] - double_MNCAATourneyDetailedResults['FGA3']
double_MNCAATourneyDetailedResults.sample(3)
FGM2 = FGM - FGM3
double_MNCAATourneyDetailedResults['FGM2'] = double_MNCAATourneyDetailedResults['FGM'] - double_MNCAATourneyDetailedResults['FGM3']
double_MNCAATourneyDetailedResults.sample(3)
# Filter by season - we don't want to include seasons without any cinderella teams:
labeled_double_MNCAATourneyDetailedResults = double_MNCAATourneyDetailedResults[double_MNCAATourneyDetailedResults['Season'].isin(season_team_cinderellas['Season'].tolist())]
cinderellas = cinderellas.rename(columns={"WTeamID": "TeamID"}) # rename columns
top_seeded = top_seeded.rename(columns={"WTeamID": "TeamID"}) # rename columns
cols = ['Season', 'TeamID']
labeled_double_MNCAATourneyDetailedResults = labeled_double_MNCAATourneyDetailedResults.join(cinderellas.set_index(cols), on=cols)
labeled_double_MNCAATourneyDetailedResults = labeled_double_MNCAATourneyDetailedResults.join(top_seeded.set_index(cols), on=cols)
labeled_double_MNCAATourneyDetailedResults
How many Cinderella team games do we have in this data:
labeled_double_MNCAATourneyDetailedResults.Cinderella.value_counts()
Continue adding labels:
# Create a categorical LABEL column:
label = labeled_double_MNCAATourneyDetailedResults[['Cinderella', 'Top']]
label = pd.DataFrame(label.idxmax(1))
labeled_double_MNCAATourneyDetailedResults['LABEL'] = label
# Fill in the missing values:
labeled_double_MNCAATourneyDetailedResults['LABEL'] = labeled_double_MNCAATourneyDetailedResults['LABEL'].fillna("Ordinary")
# Fill in the missing values:
labeled_double_MNCAATourneyDetailedResults['Cinderella'] = labeled_double_MNCAATourneyDetailedResults['Cinderella'].fillna(0) # not a cinderella
labeled_double_MNCAATourneyDetailedResults['Top'] = labeled_double_MNCAATourneyDetailedResults['Top'].fillna(0) # not a top
labeled_double_MNCAATourneyDetailedResults
labeled_double_MNCAATourneyDetailedResults.Cinderella.value_counts()
Data Section 2 file: MRegularSeasonDetailedResults.csv - this file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season [1].
MRegularSeasonDetailedResults = None
MRegularSeasonDetailedResults = load_file(MRegularSeasonDetailedResults, 'MRegularSeasonDetailedResults')
Repeat all the same steps for the regular season detailed results
# Columns about winning team:
winning = pd.concat([MRegularSeasonDetailedResults.iloc[:,:4],
MRegularSeasonDetailedResults.iloc[:,4:5], # LTeamID
MRegularSeasonDetailedResults.iloc[:,8:21],
MRegularSeasonDetailedResults.iloc[:,27:29]], # opponent OR, DR
axis=1, sort=False)
winning['TeamID'] = winning['WTeamID']
winning['won'] = 1
# Columns about losing team:
losing = pd.concat([MRegularSeasonDetailedResults.iloc[:,:3],
MRegularSeasonDetailedResults.iloc[:,5:6], # LScore
MRegularSeasonDetailedResults.iloc[:,4:5], # LTeamID
MRegularSeasonDetailedResults.iloc[:,21:34],
MRegularSeasonDetailedResults.iloc[:,14:16]], # opponent OR, DR
axis=1, sort=False)
losing['TeamID'] = losing['LTeamID']
losing['won'] = 0
# Rename columns:
winning.columns = new_columns
losing.columns = new_columns
# Concatenate:
frames = [winning, losing]
double_MRegularSeasonDetailedResults = pd.concat(frames)
print(len(double_MRegularSeasonDetailedResults))
double_MRegularSeasonDetailedResults['Round'] = "Regular Season"
double_MRegularSeasonDetailedResults['FGA2'] = double_MRegularSeasonDetailedResults['FGA'] - double_MRegularSeasonDetailedResults['FGA3']
double_MRegularSeasonDetailedResults['FGM2'] = double_MRegularSeasonDetailedResults['FGM'] - double_MRegularSeasonDetailedResults['FGM3']
double_MRegularSeasonDetailedResults
# Filter by season - we don't want to include seasons without any cinderella teams:
labeled_double_MRegularSeasonDetailedResults = double_MRegularSeasonDetailedResults[double_MRegularSeasonDetailedResults['Season'].isin(season_team_cinderellas['Season'].tolist())]
cols = ['Season', 'TeamID']
labeled_double_MRegularSeasonDetailedResults = labeled_double_MRegularSeasonDetailedResults.join(cinderellas.set_index(cols), on=cols)
labeled_double_MRegularSeasonDetailedResults = labeled_double_MRegularSeasonDetailedResults.join(top_seeded.set_index(cols), on=cols)
# Create a categorical LABEL column:
label = labeled_double_MRegularSeasonDetailedResults[['Cinderella', 'Top']]
label = pd.DataFrame(label.idxmax(1))
labeled_double_MRegularSeasonDetailedResults['LABEL'] = label
# Fill in the missing values:
labeled_double_MRegularSeasonDetailedResults['LABEL'] = labeled_double_MRegularSeasonDetailedResults['LABEL'].fillna("Ordinary")
# Fill in the missing values:
labeled_double_MRegularSeasonDetailedResults['Cinderella'] = labeled_double_MRegularSeasonDetailedResults['Cinderella'].fillna(0) # not a cinderella
labeled_double_MRegularSeasonDetailedResults['Top'] = labeled_double_MRegularSeasonDetailedResults['Top'].fillna(0) # not a top
# Calculate Rebound Margin:
labeled_double_MRegularSeasonDetailedResults['Rebound Margin'] = (labeled_double_MRegularSeasonDetailedResults['OR'] +
labeled_double_MRegularSeasonDetailedResults['DR']) - \
(labeled_double_MRegularSeasonDetailedResults['OppOR'] +
labeled_double_MRegularSeasonDetailedResults['OppDR'])
labeled_double_MRegularSeasonDetailedResults
Check the resulting labels:
labeled_double_MRegularSeasonDetailedResults.LABEL.value_counts()
Copy stats of winning teams to a separate dataframe (for both regular season and tournaments)
### Regular season ###
reg_winning_stats = labeled_double_MRegularSeasonDetailedResults[labeled_double_MRegularSeasonDetailedResults['won'] == 1]
reg_winning_stats.sample(3)
### Tournaments ###
tourney_winning_stats = labeled_double_MNCAATourneyDetailedResults[labeled_double_MNCAATourneyDetailedResults['won'] == 1]
tourney_winning_stats.sample(3)
Make two lists with dataframes that will be used in plots:
# Make two lists with dataframes that will be used in plots:
detailed_results_dfs = [labeled_double_MRegularSeasonDetailedResults,
labeled_double_MNCAATourneyDetailedResults] # all games
winning_dfs = [reg_winning_stats, tourney_winning_stats] # games won
def print_distribution_comments(metric_name):
'''Print Cinderella vs. Ordinary stats for different seasons and games'''
cinderella_vs_ordinary(detailed_results_dfs[0], "played", "regular season", metric_name)
cinderella_vs_ordinary(detailed_results_dfs[1], "played", "tournaments", metric_name)
cinderella_vs_ordinary(winning_dfs[0], "won", "regular season", metric_name)
cinderella_vs_ordinary(winning_dfs[1], "won", "tournaments", metric_name)
fig = make_subplots(rows=2, cols=1,
shared_xaxes=True, vertical_spacing = 0.15)
two_point_x = [] # default data, all X values
three_point_x = [] # data to use on button click, all X values
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# All games (visible):
fig.add_trace(
go.Box(x=plot_df['FGM2'],
name=label,
marker_color=sns.color_palette("colorblind").as_hex()[i],
boxmean=True, # represent mean
boxpoints='suspectedoutliers',
visible=True), # only suspected outliers
row=row, col=1)
two_point_x.append(plot_df['FGM2'])
three_point_x.append(plot_df['FGM3'])
# Games won (not visible by default):
fig.add_trace(
go.Box(x=won_plot_df['FGM2'],
name=label,
marker_color=sns.color_palette("colorblind").as_hex()[i],
boxmean=True, # represent mean
boxpoints='suspectedoutliers',
visible=False), # only suspected outliers
row=row, col=1)
two_point_x.append(won_plot_df['FGM2'])
three_point_x.append(won_plot_df['FGM3'])
i+=1
row+=1 # go to next subplot
# Default visibility:
show_all_games = [True, False, True, False, True, False, # row 1: OO CC TT ('Ordinary', 'Cinderella', 'Top')
True, False, True, False, True, False] # row 2: OO CC TT ('Ordinary', 'Cinderella', 'Top')
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=750) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
default_title = "Two-point field goals made (distribution) vs. team category,<br>2003-2019. Interactive graph."
hidden_title = "Three-point field goals made (distribution) vs. team category,<br>2003-2019. Interactive graph."
default_xtitle = dict(x=0.5, y=-0.1, xref="paper", yref="paper",
text="Two-point field goals per game",
showarrow=False, font=dict(size=14))
hidden_xtitle = dict(x=0.5, y=-0.1, xref="paper", yref="paper",
text="Three-point field goals per game",
showarrow=False, font=dict(size=14))
upper_subplot_title = dict(x=0.5, y=1.05, xref="paper", yref="paper",
text="Regular season", showarrow=False, font=dict(size=16))
lower_subplot_title = dict(x=0.5, y=0.45, xref="paper", yref="paper",
text="Tournaments", showarrow=False, font=dict(size=16))
# Add subplot titles:
fig.add_annotation(upper_subplot_title)
fig.add_annotation(lower_subplot_title)
# Add annotations:
fig.add_annotation(default_xtitle)
default_annotations = [default_xtitle, upper_subplot_title, lower_subplot_title]
hidden_annotations = [hidden_xtitle, upper_subplot_title, lower_subplot_title]
# Add buttons:
fig.update_layout(
updatemenus=[
dict( # these buttons will change data
type="buttons",
direction="right",
active=0,
x=0.45,
y=1.2,
buttons=list([
dict(label="2-point goals",
method="update",args=[{"x": two_point_x},
{"title": default_title,
"annotations": default_annotations},
{"visible": show_all_games}]),
dict(label="3-point goals",
method="update",args=[{"x": three_point_x},
{"title": hidden_title,
"annotations": hidden_annotations},
{"visible": show_all_games}])
]),
),
dict( # these buttons will change visibility of "games won"
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
direction="down",
showactive=True,
x=0.8,
y=1.2,
)
])
# Plot title:
fig.update_layout(
title={
'text': default_title,
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=180) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_3 = go.Figure(fig) # to show the same fig in the Results section
print_distribution_comments("FGM2")
print_distribution_comments("FGM3")
label_colors = sns.color_palette("colorblind").as_hex()[0:3]
fig = make_subplots(rows=2, cols=1,
shared_xaxes=True, vertical_spacing = 0.15)
two_point_x = [] # default data, all X values
three_point_x = [] # data to use on button click, all X values
two_point_text = [] # default text on the bar
three_point_text = [] # text on the bar on button click
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0 # counter for labels
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# All games (visible) background layer:
background_x = plot_df['FGA2'].mean()
front_x = plot_df['FGM2'].mean()
fig.add_trace(
go.Bar(x=[background_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=True,
opacity=0.5,
orientation='h'),
row=row, col=1)
two_point_x.append([background_x])
three_point_x.append([plot_df['FGA3'].mean()])
two_point_text.append("") # empty because there is no text in background bar
three_point_text.append("")
# All games (visible) front layer:
fig.add_trace(
go.Bar(x=[front_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=True,
orientation='h',
text=(front_x/background_x), # calculate the ratio
textposition='auto',
texttemplate='%{text:.1%}'), # format output
row=row, col=1)
two_point_x.append([front_x])
three_point_x.append([plot_df['FGM3'].mean()])
two_point_text.append(front_x/background_x)
three_point_text.append(plot_df['FGM3'].mean()/plot_df['FGA3'].mean())
# Games won (not visible by default) background layer:
background_x = won_plot_df['FGA2'].mean()
front_x = won_plot_df['FGM2'].mean()
fig.add_trace(
go.Bar(x=[background_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=False,
opacity=0.5,
orientation='h'),
row=row, col=1)
two_point_x.append([background_x])
three_point_x.append([won_plot_df['FGA3'].mean()]) # not visible
two_point_text.append("") # empty because there is no text in background bar
three_point_text.append("")
# Games won (not visible by default) front layer:
fig.add_trace(
go.Bar(x=[front_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=False,
orientation='h',
text=(front_x/background_x), # calculate the ratio
textposition='auto',
texttemplate='%{text:.1%}'), # format output
row=row, col=1)
two_point_x.append([front_x])
three_point_x.append([won_plot_df['FGM3'].mean()])
two_point_text.append(front_x/background_x)
three_point_text.append(won_plot_df['FGM3'].mean()/won_plot_df['FGA3'].mean())
i+=1
row+=1 # go to next subplot
fig.update_layout(barmode='overlay') # the bars are plotted over one another
# Default visibility:
show_all_games = [True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False, # 'Top'
True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False] # 'Top'
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=550) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
default_title = "Mean 2-point field goal ratio vs. team category,<br>2003-2019. Interactive graph."
hidden_title = "Mean 3-point field goal ratio vs. team category,<br>2003-2019. Interactive graph."
default_xtitle = dict(x=0.5, y=-0.15, xref="paper", yref="paper",
text="Mean 2-point goals per game (scored / attempted)",
showarrow=False, font=dict(size=14))
hidden_xtitle = dict(x=0.5, y=-0.15, xref="paper", yref="paper",
text="Mean 3-point goals per game (scored / attempted)",
showarrow=False, font=dict(size=14))
upper_subplot_title = dict(x=0.5, y=1.075, xref="paper", yref="paper",
text="Regular season", showarrow=False, font=dict(size=16))
lower_subplot_title = dict(x=0.5, y=0.475, xref="paper", yref="paper",
text="Tournaments", showarrow=False, font=dict(size=16))
# Add subplot titles:
fig.add_annotation(upper_subplot_title)
fig.add_annotation(lower_subplot_title)
# Add annotations:
fig.add_annotation(default_xtitle)
default_annotations = [default_xtitle, upper_subplot_title, lower_subplot_title]
hidden_annotations = [hidden_xtitle, upper_subplot_title, lower_subplot_title]
# Add buttons:
fig.update_layout(
updatemenus=[
dict( # these buttons will change data
type="buttons",
direction="right",
active=0,
x=0.45,
y=1.2,
buttons=list([
dict(label="2-point goals",
method="update",args=[{"x": two_point_x, "text": two_point_text},
{"title": default_title,
"annotations": default_annotations},
{"visible": show_all_games}]),
dict(label="3-point goals",
method="update",args=[{"x": three_point_x, "text": three_point_text},
{"title": hidden_title,
"annotations": hidden_annotations},
{"visible": show_all_games}])
]),
),
dict( # these buttons will change visibility of "games won"
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
direction="down",
showactive=True,
x=0.8,
y=1.2,
)
])
# Plot title:
fig.update_layout(
title={
'text': default_title,
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=150) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_5 = go.Figure(fig) # to show the same fig in the Results section
print_distribution_comments("FGA2")
print_distribution_comments("FGA3")
label_colors = sns.color_palette("colorblind").as_hex()[0:3]
fig = make_subplots(rows=2, cols=1, subplot_titles=("Regular season", "Tournaments"),
shared_xaxes=True, vertical_spacing = 0.15)
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0 # counter for labels
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# All games - visible:
background_x = plot_df['Ast'].mean()
front_x = plot_df['TO'].mean()
def plot_bar(background_x, front_x, visible):
# Upper bar:
fig.add_trace(
go.Bar(x=[background_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=visible,
opacity=1,
orientation='h', width=0.35, offset=-0.05,
text=(background_x/front_x), # calculate the ratio
textposition='outside',
texttemplate='%{text:.2f}'), # format output
row=row, col=1)
# Lower bar:
fig.add_trace(
go.Bar(x=[front_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=visible,
opacity=0.35,
orientation='h', width=0.35, offset=-0.40),
row=row, col=1)
plot_bar(background_x, front_x, True)
# Games won - not visible by default:
background_x = won_plot_df['Ast'].mean()
front_x = won_plot_df['TO'].mean()
plot_bar(background_x, front_x, False)
i+=1
row+=1 # go to next subplot
# Default visibility:
show_all_games = [True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False, # 'Top'
True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False] # 'Top'
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=650) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='Mean assists / turnovers', row=2, col=1)
# Add buttons:
fig.update_layout(
updatemenus=[
dict(
type="buttons",
direction="right",
active=0,
x=0.6,
y=1.2,
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
)
])
# Plot title:
fig.update_layout(
title={
'text': "Mean Assist to Turnover Ratio vs. team category,<br>2003-2019. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=150) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_6 = go.Figure(fig) # to show the same fig in the Results section
palette = [sns.color_palette("cubehelix", 10).as_hex()[6], sns.color_palette("cubehelix", 10).as_hex()[1], 'gold']
label_colors_a = [palette[0], palette[0], palette[0]]
label_colors_b = [palette[1], palette[1], palette[1]]
label_colors_c = [palette[2], palette[2], palette[2]]
fig = make_subplots(rows=2, cols=1, subplot_titles=("Regular season", "Tournaments"),
shared_xaxes=True, vertical_spacing = 0.15)
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0 # counter for labels
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# All games - visible:
colors = [label_colors_a, label_colors_b, label_colors_c]
scores = ['DR <br>', 'STL <br>', 'BLK <br>'] # text to show
# Move common code to a function to reuse multiple times:
def plot_bar(df, scores, colors, visible):
x_list = [df['DR'].mean(), df['Stl'].mean(), df['Blk'].mean()]
for x, score, colors in zip(x_list, scores, colors):
fig.add_trace(
go.Bar(x=[x], # just one number value for a bar
y=[label],
name=label,
marker_color=colors[i],
visible=visible,
opacity=1,
orientation='h',
text=(x),
textposition='inside',
texttemplate=score + '%{text:.2f}'),
row=row, col=1)
plot_bar(plot_df, scores, colors, True)
# Games won - not visible by default:
plot_bar(won_plot_df, scores, colors, False)
i+=1
row+=1 # go to next subplot
fig.update_layout(barmode='stack')
# Controlling text fontsize with uniformtext
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='show')
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=550) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='DR - defensive rebounds | STL - steals | BLK - blocks', row=2, col=1)
### BUTTONS ###
# Default visibility:
one_label_visibility = [[True]*3, [False]*3] # all bars for one label
one_subplot_visibility = one_label_visibility*3 # all bars for one subplot
show_all_games = sum(one_label_visibility*2, []) # all bars for two subplots converted to one list
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
# Add buttons:
fig.update_layout(
updatemenus=[
dict(
type="buttons",
direction="right",
active=0,
x=0.6,
y=1.2,
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
)
])
### END BUTTONS ###
# Plot title:
fig.update_layout(
title={
'text': "Mean defence statistics per game vs. team category,<br>2003-2019. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=150) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_7 = go.Figure(fig) # to show the same fig in the Results section
print_distribution_comments("DR")
print("\n***")
print_distribution_comments("Stl")
print("\n***")
print_distribution_comments("Blk")
label_colors = sns.color_palette("colorblind").as_hex()[0:3]
fig = make_subplots(rows=2, cols=1, subplot_titles=("Regular season", "Tournaments"),
shared_xaxes=True, vertical_spacing = 0.15)
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0 # counter for labels
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# Move common code to a function to reuse multiple times:
def plot_bar(df, visible):
background_x = df['PF'].mean()
front_x = df['Blk'].mean()
# Background layer:
fig.add_trace(
go.Bar(x=[background_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=visible,
opacity=0.5,
orientation='h'),
row=row, col=1)
# Front layer:
fig.add_trace(
go.Bar(x=[front_x], # just one value for a bar
y=[label],
name=label,
marker_color=label_colors[i],
visible=visible,
orientation='h',
text=(front_x/background_x), # calculate the ratio
textposition='auto',
texttemplate='%{text:.1%}'), # format output
row=row, col=1)
# All games:
plot_bar(plot_df, True)
# Games won:
plot_bar(won_plot_df, False)
i+=1
row+=1 # go to next subplot
fig.update_layout(barmode='overlay') # the bars are plotted over one another
# Default visibility:
show_all_games = [True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False, # 'Top'
True, True, False, False, # 'Ordinary'
True, True, False, False, # 'Cinderella'
True, True, False, False] # 'Top'
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=550) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='Mean blocks / personal fouls', row=2, col=1)
# Add buttons:
fig.update_layout(
updatemenus=[
dict(
type="buttons",
direction="right",
active=0,
x=0.6,
y=1.2,
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
)
])
# Plot title:
fig.update_layout(
title={
'text': "Mean blocks per fouls vs. team category,<br>2003-2019. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=150) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_8 = go.Figure(fig) # to show the same fig in the Results section
print_distribution_comments("PF")
fig = make_subplots(rows=2, cols=1, subplot_titles=("Regular season", "Tournaments"),
shared_xaxes=True, vertical_spacing = 0.15)
row = 1 # row nr. for subplot
for df, won_df in zip(detailed_results_dfs, winning_dfs): # Make plots for both regular season and tournaments:
i = 0
for label in order: # 'Ordinary', 'Cinderella', 'Top'
plot_df = df[df.LABEL == label] # all games
won_plot_df = won_df[won_df.LABEL == label] # games won
# All games (visible):
fig.add_trace(
go.Box(x=plot_df['Rebound Margin'],
name=label,
marker_color=sns.color_palette("colorblind").as_hex()[i],
boxmean=True, # represent mean
boxpoints='suspectedoutliers',
visible=True), # only suspected outliers
row=row, col=1)
# Games won (not visible by default):
fig.add_trace(
go.Box(x=won_plot_df['Rebound Margin'],
name=label,
marker_color=sns.color_palette("colorblind").as_hex()[i],
boxmean=True, # represent mean
boxpoints='suspectedoutliers', # only suspected outliers
visible=False),
row=row, col=1)
i+=1
row+=1 # go to next subplot
# Add vertical line to represent zero Rebound Margin:
fig.update_layout(
shapes=[
dict(type="line", xref="x1", yref="y1", # col 1, row 1
x0=0, y0=-1, x1=0, opacity=0.5,
line=dict(dash='dash', color='grey')),
dict(type="line", xref="x1", yref="y2", # col 1, row 2
x0=0, y0=-1, x1=0, opacity=0.5,
line=dict(dash='dash', color='grey'))])
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=750) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='Rebound Margin per game', row=2, col=1)
# Default visibility:
show_all_games = [True, False, True, False, True, False, # row 1: OO CC TT ('Ordinary', 'Cinderella', 'Top')
True, False, True, False, True, False] # row 2: OO CC TT ('Ordinary', 'Cinderella', 'Top')
# Opposite visibility (reverse above list):
show_games_won = show_all_games[::-1]
# Add buttons:
fig.update_layout(
updatemenus=[
dict(
type="buttons",
direction="right",
active=0,
x=0.6,
y=1.2,
buttons=list([
dict(label="All games",
method="restyle",args=[{"visible": show_all_games}]),
dict(label="Games won",
method="restyle",args=[{"visible": show_games_won}])
]),
)
])
# Plot title:
fig.update_layout(
title={
'text': "Rebound Margin (distribution) vs. team category,<br>2003-2019. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=180) # margin between title and plot
)
fig.show(renderer="kaggle")
fig_10 = go.Figure(fig) # to show the same fig in the Results section
print_distribution_comments('Rebound Margin')
Data Section 5 files: MEvents2015.csv, MEvents2016.csv, MEvent2017.csv, MEvents2018.csv, MEvents2019.csv, MEvents2020.csv - each MEvents file lists the play-by-play event logs for more than 99.5% of games from that season. Each event is assigned to either a team or a single one of the team's players.
EventTeamID - this is the ID of the team that the event is logged for, which will either be the WTeamID or the LTeamID [1].
events_dir = '/kaggle/input/march-madness-analytics-2020/MPlayByPlay_Stage2/'
if sys.executable != '/opt/conda/bin/python':
# remove the forward slash if running this notebook locally:
events_dir = events_dir[1:]
def load_events_file(df, name):
'''Load in the file and show basic info'''
print("File: {}".format(name))
df = pd.read_csv(events_dir + name + '.csv')
print("Num rows: {}".format(len(df)))
print("NaN values: {}".format(df.isna().sum().sum()))
print("Duplicated rows: {}".format(df.duplicated().sum()))
print(list(df))
print("\n")
return(df)
Load in the files:
MEvents2015 = None
MEvents2015 = load_events_file(MEvents2015, 'MEvents2015')
MEvents2016 = None
MEvents2016 = load_events_file(MEvents2016, 'MEvents2016')
MEvents2017 = None
MEvents2017 = load_events_file(MEvents2017, 'MEvents2017')
MEvents2018 = None
MEvents2018 = load_events_file(MEvents2018, 'MEvents2018')
MEvents2019 = None
MEvents2019 = load_events_file(MEvents2019, 'MEvents2019')
MEvents2020 = None
MEvents2020 = load_events_file(MEvents2020, 'MEvents2020')
MEvents2015
# Just a test:
pd.concat([MEvents2020.head(3), MEvents2020.tail(2)])
Make one common MEvents dataframe via concatenating 6 files together
MEvents = pd.concat([MEvents2015, MEvents2016, MEvents2017, MEvents2018, MEvents2019, MEvents2020],
sort=False, ignore_index=True)
print("Play-by-play event logs (all 15835846 logged events):")
MEvents
Make a separate dataframe for labeled NCAA® tournament events
len(labeled_MNCAATourneyCompactResults[labeled_MNCAATourneyCompactResults.Season.isin([2015,2016,2017,2018,2019])])
%%time
cols = ['Season', 'DayNum', 'WTeamID', 'LTeamID']
labeled_tourney_MEvents = MEvents.join(labeled_MNCAATourneyCompactResults.set_index(cols), on=cols, how='inner')
print(len(labeled_tourney_MEvents.groupby(cols).sum())) # should be 335
labeled_tourney_MEvents
min(labeled_tourney_MEvents.DayNum), max(labeled_tourney_MEvents.DayNum) # just a test
Are missing values in X, Y encoded as zeros?
fig, ax = plt.subplots(1,2, figsize = (14, 6))
ax[0].hist(MEvents['X'], bins=20)
ax[1].hist(MEvents[MEvents['X'] != 0]['X'], bins=20)
ax[0].set_title("Including zeros")
ax[1].set_title("Without zeros")
plt.suptitle("Distribution of X coordinate values in MEvents dataframe.", y=1.05)
plt.show()
Get rid of zero coordinates
Select rows where "X" is not zero:
court_MEvents = MEvents[MEvents.X != 0]
court_labeled_tourney_MEvents = labeled_tourney_MEvents[labeled_tourney_MEvents['X'] != 0]
What events are available with coordinates?
# What events are available with coordinates?
court_MEvents.EventType.value_counts()
Which season data has events with coordinates?
court_MEvents.Season.value_counts()
# Court outline (image source: [25])
line_img = plt.imread("https://raw.githubusercontent.com/evanca/data-analysis_kaggle_march-madness-analytics-2020/master/img/Vve3bT9.png")
df = court_MEvents[court_MEvents.EventType == 'made3']
fig, ax = plt.subplots(figsize=(14,7.5))
sns.kdeplot(df['X'], df['Y'], shade=True, cmap='Reds',
n_levels=25, alpha=1).set(xlim=(0, 100), ylim=(0, 100))
ax.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.15, zorder=10)
# Remove coordinates:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.title('Three-point goal heatmap,\n2019-2020.\n')
plt.show()
print("\n Area nr. vs. three-point goals:")
df = pd.DataFrame(court_MEvents[court_MEvents.EventType == 'made3']['Area'].value_counts())
df['Share'] = df['Area'] / sum(df['Area'])
df.columns=['Sum', 'Share']
df = df.style.format({'Share': "{:.2%}"})
display(df)
print("\n9 = outside right\n10 = outside center\n11 = outside left")
df = court_MEvents[(court_MEvents.EventType == 'turnover') &
(court_MEvents.Area.isin([8,9,10,11,12]))]
fig, ax = plt.subplots(figsize=(14,7.5))
sns.kdeplot(df['X'], df['Y'], shade=True, cmap='Blues',
n_levels=25, alpha=1).set(xlim=(0, 100), ylim=(0, 100))
ax.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.15, zorder=10)
# Remove coordinates:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.title('Turnover heatmap beyond the three-point line,\n2019-2020.\n')
plt.show()
print("\n Area nr. vs. turnovers:")
df = pd.DataFrame(court_MEvents[(court_MEvents.EventType == 'turnover')&
(court_MEvents.Area.isin([8,9,10,11,12]))]['Area'].value_counts())
df['Share'] = df['Area'] / sum(df['Area'])
df.columns=['Sum', 'Share']
df = df.style.format({'Share': "{:.2%}"})
display(df)
print("\n9 = outside right\n10 = outside center\n11 = outside left")
Make a final image to show in the Results section:
fig = plt.figure(figsize = (14, 15))
### Plot 1 ###
ax1 = fig.add_axes([0, 0.5, 0.8, 0.4]) # [left, bottom, width, height]
df = court_MEvents[court_MEvents.EventType == 'made3']
sns.kdeplot(df['X'], df['Y'], shade=True, cmap='Reds',
n_levels=25, alpha=1, ax=ax1).set(xlim=(0, 100), ylim=(0, 100))
ax1.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.15, zorder=10)
# Remove coordinates:
ax1.get_xaxis().set_visible(False)
ax1.get_yaxis().set_visible(False)
ax1.set_title('Figure 5. Three-point goal heatmap,\n2018-19 and 2019-20.\n')
### Plot 2 ###
ax2 = fig.add_axes([0, 0, 0.8, 0.4]) # [left, bottom, width, height]
df = court_MEvents[(court_MEvents.EventType == 'turnover') &
(court_MEvents.Area.isin([8,9,10,11,12]))]
sns.kdeplot(df['X'], df['Y'], shade=True, cmap='Blues',
n_levels=25, alpha=1, ax=ax2).set(xlim=(0, 100), ylim=(0, 100))
ax2.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.15, zorder=10)
# Remove coordinates:
ax2.get_xaxis().set_visible(False)
ax2.get_yaxis().set_visible(False)
ax2.set_title('Figure 6. Turnover heatmap beyond the three-point line,\n2018-19 and 2019-20.\n')
save_plot()
plt.show()
binned_tourney_MEvents = labeled_tourney_MEvents
# Create interval for bins
interval_range = pd.interval_range(start=0, freq=300, end=binned_tourney_MEvents['ElapsedSeconds'].max())
# Create a bin column
binned_tourney_MEvents['bin'] = pd.cut(labeled_tourney_MEvents['ElapsedSeconds'], interval_range)
assert len(binned_tourney_MEvents) == len(labeled_tourney_MEvents)
binned_tourney_MEvents.sample(3)
# The same but without Nulls:
len(binned_tourney_MEvents[binned_tourney_MEvents['bin'].notna()])
Make a column with minute values (instead of seconds):
# Clean the data: remove Null bins
binned_tourney_MEvents = binned_tourney_MEvents[binned_tourney_MEvents['bin'].notna()]
# Make a column with minute values (instead of seconds)
binned_tourney_MEvents['bin'] = binned_tourney_MEvents['bin'].astype(str)
binned_tourney_MEvents = pd.concat([binned_tourney_MEvents, binned_tourney_MEvents['bin'].str.split(', ', expand=True)], axis=1)
binned_tourney_MEvents.sample(3)
Format "bin" column:
# Format "bin" column to "nr - nr" output:
binned_tourney_MEvents[0] = (binned_tourney_MEvents[0].str.extract(r'(\d+)').astype(int)/60).astype(int) # extract numbers and convert to minutes
binned_tourney_MEvents[1] = (binned_tourney_MEvents[1].str.extract(r'(\d+)').astype(int)/60).astype(int)
binned_tourney_MEvents = binned_tourney_MEvents.sort_values(0) # sort by bins
binned_tourney_MEvents['bin'] = binned_tourney_MEvents[0].astype(str) + " - " + binned_tourney_MEvents[1].astype(str)
binned_tourney_MEvents.sample(3)
Calculate total field goals vs. elapsed time
# Count EventID cells for each column:
field_goals = binned_tourney_MEvents[(binned_tourney_MEvents['EventType'] == 'made2') |
(binned_tourney_MEvents['EventType'] == 'made3')][['Season', 'EventID',
'bin']].groupby(['Season', 'bin'],
as_index=False,
sort=False).count()
field_goals.head()
# Create figure
fig = go.Figure()
# Add traces, one for each slider step
all_seasons = [2015, 2016, 2017, 2018, 2019]
# First trace - step [0] with all data:
df = field_goals.groupby('bin', as_index=False, sort=False).sum()
y=df['EventID']
# Different color for the biggest column:
color=np.array([sns.color_palette("cubehelix", 10).as_hex()[6]]*y.shape[0])
color[y < max(y)]=sns.color_palette("cubehelix", 10).as_hex()[5]
fig.add_trace(
go.Bar(
visible=False,
x=df['bin'],
y=y,
marker_color=color.tolist(),
text=(y),
textposition='outside'))
# Next 5 steps by season:
for season in all_seasons:
df = field_goals[field_goals.Season == season]
y=df['EventID']
# Different color for the biggest column:
color=np.array([sns.color_palette("cubehelix", 10).as_hex()[6]]*y.shape[0])
color[y < max(y)]=sns.color_palette("cubehelix", 10).as_hex()[5]
fig.add_trace(
go.Bar(
visible=False,
x=df['bin'],
y=y,
marker_color=color.tolist(),
text=(y),
textposition='outside'))
# Make 0th trace visible
fig.data[0].visible = True
### ADD SLIDER ###
steps = []
step_labels = ['ALL<br>(2015-2019)', '2015', '2016', '2017', '2018', '2019']
for i in range(len(fig.data)):
step = dict(
label=step_labels[i],
method="restyle",
args=["visible", [False] * len(fig.data)],
)
step["args"][1][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=0,
currentvalue={"prefix": "Season: "},
pad={"t": 50},
steps=steps
)]
fig.update_layout(
sliders=sliders
)
### END SLIDER ###
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=750) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='Elapsed time, minutes')
fig.update_yaxes(title_text='Total goals')
# Plot title:
fig.update_layout(
title={
'text': "Field goals scored vs. elapsed time,<br>2015-2019 NCAA® tournaments. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000")
)
fig.show(renderer="kaggle")
fig_1 = go.Figure(fig) # to show the same fig in the Results section
from matplotlib.lines import Line2D
colors = [sns.color_palette("cubehelix", 10)[6], sns.color_palette("cubehelix", 10)[1], 'gold']
fig, ax = plt.subplots(figsize=(14,7.5))
# Show background image:
ax.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.5)
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'miss2'],
alpha=0.35,
edgecolor=None,
color=colors[1]).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'made2'],
alpha=0.35,
edgecolor=None,
color=colors[0]).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'miss3'],
alpha=0.35,
edgecolor=None,
color=colors[1]).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'made3'],
alpha=0.35,
edgecolor=None,
color=colors[0]).set(xlim=(0, 100), ylim=(0, 100))
ax = plt.gca()
ax.legend(handles=[(Line2D([0],[0], marker='o', markerfacecolor=colors[0],
linestyle='none', markersize=10, markeredgecolor='none')),
(Line2D([0],[0], marker='o', markerfacecolor=colors[1],
linestyle='none', markersize=10, markeredgecolor='none'))],
labels=["goal made", "goal missed"], loc="upper center")
# Remove coordinate values:
ax = plt.gca()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.title("Figure 4. Field goal accuracy by player location,\n2015-2019 NCAA® tournaments.\n")
plt.show()
Calculate goal distance in m based on X, Y coordinates
The court has 15.2 m (50 ft) width and 28.7 m (94 feet) length, equivalent to 100X * 100Y. From this we can calculate one "square" size: 0.152m width x 0.287m length.
2 points are awarded to players who successfully shoot the ball through the hoop from anywhere inside the three-point line. This can be done by shooting a jump shot, laying the ball into the rim, or slamming the ball through the hoop. 3 points are warded to players who successfully shoot the ball through the hoop from behind the three-point line [26].
Note that we will assume here that all 3 point goals were made from the same side of the court where the basket is. While this will be true for most cases, few insignificant errors are possible (three-pointer from the other side of the court).
Our pseudocode:
# Filter out all field goal rows for NCAA® tournaments:
goals = court_labeled_tourney_MEvents[court_labeled_tourney_MEvents['EventType'].isin(['made2', 'miss2', 'made3', 'miss3'])]
goals
What are the coordinates of each basket?
# Right basket:
goals[(goals['Area'] == 1) & (goals['X'] > 50)][["X", "Y"]].mean().round()
From this result, we can tell that coordinates for the left basket are: X=7, Y=50.
Next, convert our coordinates to meters (as if our court was divided into 1x1m grid):
goals['XMeters'] = goals['X']*0.287
goals['YMeters'] = goals['Y']*0.152
goals.sample(3)
Basket coordinates in meters:
# Basket coordinates in meters:
right_basket_m = (93*0.287, 50*0.152)
left_basket_m = (7*0.287, 50*0.152)
print(right_basket_m, left_basket_m)
Calculate the distance
%%time
import math
def calculate_distance(x1,y1, x2,y2):
'''Calculate distance between two points'''
dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)
return dist
goals['ShotDistanceMeters'] = None
len_goals = len(goals)
i = 1
for index, row in goals.iterrows():
if goals['X'][index] > 50: # right basket
goals['ShotDistanceMeters'][index] = calculate_distance(goals['XMeters'][index],
goals['YMeters'][index],
right_basket_m[0],
right_basket_m[1])
elif goals['X'][index] < 50: # left basket
goals['ShotDistanceMeters'][index] = calculate_distance(goals['XMeters'][index],
goals['YMeters'][index],
left_basket_m[0],
left_basket_m[1])
i+=1
print("Updating row nr. {} / of {}".format(i, len_goals)
+ " "*100, end="\r", flush=True) # erase output and print on the same line
print(" "*100, end="\r", flush=True) # erase final output
goals.sample(3)
goals['ShotDistanceMeters'].isna().sum() # just a test
Split distance to 0.5 m bins:
min(goals['ShotDistanceMeters']), max(goals['ShotDistanceMeters'])
from operator import attrgetter
# Create interval for bins
interval_range = pd.interval_range(start=-0.5, freq=0.5, end=12)
# Create a bin column
goals['DistanceBin'] = pd.cut(goals['ShotDistanceMeters'], interval_range)
# Note the right number of interval (a square bracket is inclusive)
goals['DistRightBound'] = goals['DistanceBin'].map(attrgetter('right'))
goals.sample(3)
Calculate shot accuracy per distance in meters
df = goals.groupby(['DistRightBound', 'LABEL', 'EventType'], as_index=False)['EventID'].count()
df
One-hot encode "made" or "miss":
dummies = pd.get_dummies(df['EventType']) # what type of event
for col in dummies:
df[col] = dummies[col]*df['EventID'] # how many of such events
df[~df.EventID.isna()]
Calculate field-goal percentage (goal accuracy in %)
shooting_accuracy_df = df.groupby(['DistRightBound'], as_index = False).sum()
shooting_accuracy_df.sample(3)
labeled_shooting_accuracy_df = df.groupby(['DistRightBound', 'LABEL'], as_index = False).sum()
labeled_shooting_accuracy_df.sample(3)
# Calculate field-goal percentage (goal accuracy in %):
for df in [shooting_accuracy_df, labeled_shooting_accuracy_df]:
df['GoalsScored'] = df['made2'] + df['made3']
df['GoalAccuracy'] = df['GoalsScored'] / df['EventID']
shooting_accuracy_df.sample(3)
Make final image:
import matplotlib.ticker as mtick
fig = plt.figure(figsize = (14, 20))
### Plot 1 ###
ax1 = fig.add_axes([0, 0.4, 0.8, 0.29]) # [left, bottom, width, height]
sns.lineplot(x='DistRightBound', y='GoalAccuracy', data=shooting_accuracy_df, color='crimson', ax=ax1)
ax1.set_xlabel("Shot distance in meters")
ax1.set_ylabel("Shooting accuracy in %")
# Y ticks ar percentage:
ax1.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
# Show 3 point line:
ax1.axvline(x=6.3246, color='grey', linestyle='--') # 20 feet, 9 inches
ax1.annotate('three-point line',
xy=(6.37, 0.7), xycoords='data',
xytext=(-100, -50), textcoords='offset points')
# Add secondary axis to also show feet distance:
def m2feet(x):
return x * 3.28084
def feet2m(x):
return x / 3.28084
secax = ax1.secondary_xaxis('top', functions=(m2feet, feet2m))
secax.set_xlabel('Shot distance in feet')
ax1.set_title("Figure 3. Field-goal shooting accuracy % by distance,\n2015-2019 NCAA® tournaments.\n")
### Plot 2 ###
ax2 = fig.add_axes([0, 0, 0.8, 0.29]) # [left, bottom, width, height]
# Show background image:
ax2.imshow(line_img, extent=[0, 100, 0, 100], aspect='auto', alpha=0.5)
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'miss2'],
alpha=0.35,
edgecolor=None,
color=colors[1], ax=ax2).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'made2'],
alpha=0.35,
edgecolor=None,
color=colors[0], ax=ax2).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'miss3'],
alpha=0.35,
edgecolor=None,
color=colors[1], ax=ax2).set(xlim=(0, 100), ylim=(0, 100))
sns.scatterplot(x='X', y='Y', data=court_labeled_tourney_MEvents[court_labeled_tourney_MEvents.EventType == 'made3'],
alpha=0.35,
edgecolor=None,
color=colors[0], ax=ax2).set(xlim=(0, 100), ylim=(0, 100))
ax2.legend(handles=[(Line2D([0],[0], marker='o', markerfacecolor=colors[0],
linestyle='none', markersize=10, markeredgecolor='none')),
(Line2D([0],[0], marker='o', markerfacecolor=colors[1],
linestyle='none', markersize=10, markeredgecolor='none'))],
labels=["goal made", "goal missed"], loc="upper center")
# Remove coordinate values:
ax2.get_xaxis().set_visible(False)
ax2.get_yaxis().set_visible(False)
ax2.set_title("\nFigure 4. Field goal accuracy by player location,\n2015-2019 NCAA® tournaments.\n")
save_plot()
plt.show()
sns.lineplot(x='DistRightBound', y='GoalAccuracy', data=labeled_shooting_accuracy_df, hue="LABEL", hue_order=order)
plt.xlabel("Shot distance in meters")
plt.ylabel("Shooting accuracy in %")
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles=handles[1:], labels=labels[1:])
# Y ticks ar percentage:
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
# Show 3 point line:
plt.axvline(x=6.3246, color='grey', linestyle='--') # 20 feet, 9 inches
plt.gca().annotate('three-point line',
xy=(6.37, 0.7), xycoords='data',
xytext=(-100, -50), textcoords='offset points')
# Add secondary axis to also show feet distance:
def m2feet(x):
return x * 3.28084
def feet2m(x):
return x / 3.28084
secax = plt.gca().secondary_xaxis('top', functions=(m2feet, feet2m))
secax.set_xlabel('Shot distance in feet')
plt.title("Figure 12. Field-goal shooting accuracy % by distance per team category,\n2015-2019 NCAA® tournaments.\n")
save_plot()
plt.show()
print("Descriptive numbers for file nr. {} (Cinderella):".format(str(file_nr-1)))
cinderella = labeled_shooting_accuracy_df[labeled_shooting_accuracy_df.LABEL == 'Cinderella']
print('\nGoalAccuracy < 20%:')
print(cinderella[cinderella.GoalAccuracy < 0.2][['DistRightBound', 'GoalAccuracy']])
print('\nGoalAccuracy > 60%:')
print(cinderella[cinderella.GoalAccuracy > 0.6][['DistRightBound', 'GoalAccuracy']])
MPlayers_file = "/kaggle/input/march-madness-analytics-2020/MPlayByPlay_Stage2/MPlayers.csv"
if sys.executable != '/opt/conda/bin/python':
# remove the forward slash if running this notebook locally:
MPlayers_file = MPlayers_file[1:]
MPlayers = pd.read_csv(MPlayers_file)
print("Num rows: {}".format(len(MPlayers)))
print("NaN values: {}".format(MPlayers.isna().sum().sum()))
print("Duplicated rows: {}".format(MEvents2019.duplicated().sum()))
pd.concat([MPlayers.head(3), MPlayers.tail(2)])
Add player's last and first names to our labeled tourney MEvents data
print(len(labeled_tourney_MEvents))
list(MPlayers)
MPlayers.rename({'PlayerID': 'EventPlayerID',
'TeamID': 'EventTeamID'}, axis=1, inplace=True)
cols = ['EventPlayerID', 'EventTeamID']
players_labeled_tourney_MEvents = labeled_tourney_MEvents.join(MPlayers.set_index(cols), on=cols)
assert len(players_labeled_tourney_MEvents) == len(labeled_tourney_MEvents)
players_labeled_tourney_MEvents
One-hot encode event type
dummies = pd.get_dummies(players_labeled_tourney_MEvents['EventType']) # what type of event
assert len(dummies) == len(players_labeled_tourney_MEvents)
players_labeled_tourney_MEvents = pd.concat([players_labeled_tourney_MEvents, dummies], axis=1)
players_labeled_tourney_MEvents
Calculate offensive efficiency per season
Offensive efficiency = (FGM + A) / (FGA - OREB + A + TO) [7]
players_labeled_tourney_MEvents['OREB'] = 0 # new column, default value
players_labeled_tourney_MEvents.loc[players_labeled_tourney_MEvents.EventSubType.isin(['off', 'offdb']),
'OREB'] = 1
players_labeled_tourney_MEvents['DREB'] = 0 # new column, default value
players_labeled_tourney_MEvents.loc[players_labeled_tourney_MEvents.EventSubType.isin(['def', 'defdb']),
'DREB'] = 1
players_labeled_tourney_MEvents
players_labeled_tourney_MEvents['FGM'] = 0 # new column, default value
players_labeled_tourney_MEvents['FGA'] = 0 # new column, default value
players_labeled_tourney_MEvents.loc[players_labeled_tourney_MEvents.EventType.isin(['made2', 'made3']),
'FGM'] = 1
players_labeled_tourney_MEvents.loc[players_labeled_tourney_MEvents.EventType.isin(['made2', 'made3', 'miss2', 'miss3']),
'FGA'] = 1
players_labeled_tourney_MEvents
This step can me skipped for OE, but we will be using this data later.
# Prepare column names:
dummy_cols = list(dummies)
game_cols = ['Season', 'DayNum', 'WTeamID', 'LTeamID'] # to identify each game
cols = ['LABEL',
'EventPlayerID',
'LastName',
'FirstName'] + dummy_cols + game_cols + ['OREB', 'DREB', 'FGM', 'FGA']
print(cols)
sum_per_game_tourney_MEvents = players_labeled_tourney_MEvents.groupby(game_cols + ['LABEL',
'EventPlayerID',
'LastName',
'FirstName'],
as_index=False).sum()[cols]
print("Play-by-play event logs (tournaments grouped by game and player, sum events):")
sum_per_game_tourney_MEvents
players_season = players_labeled_tourney_MEvents.groupby(['Season',
'LABEL',
'EventPlayerID',
'LastName',
'FirstName'], as_index=False).sum()
print("Play-by-play event logs (tournaments grouped by season and player, sum events):")
players_season
Note that we removed division by zero and negative values.
players_season['OE_numerator'] = players_season['FGM'] + \
players_season['assist']
players_season['OE_denominator'] = players_season['FGA'] - \
players_season['OREB'] + \
players_season['assist'] + \
players_season['turnover']
# Remove division by zero and negative values:
players_season = players_season[(players_season['OE_denominator'] > 0)]
# Calculate the OE:
players_season['OE'] = (players_season['OE_numerator'] / \
players_season['OE_denominator']).round(2)
players_season
players_season.OE.describe()
We noticed that the formula that we are using can produce high Offensive Efficiency values even for low performing players, for example if both OE numerator and OE denominator will be "1", the output will be "1", which can lead to misleading interpretations.
Considering that we are only interested in plotting top OE players, we can fix this by applying filters based on the scores used in OE formula. We will eliminate all FGM scores below median.
players_season.FGM.describe()
# Eliminate lower 50% of summary FGM per season:
print(len(players_season))
fgm_median = players_season['FGM'].median()
players_season = players_season[(players_season['FGM'] > fgm_median)]
print(len(players_season))
players_season.OE.describe()
sns.lineplot(x='Season', y='OE', data = players_season,
hue='LABEL', hue_order=order, ci=False).set(ylim=(0.45, None))
plt.xlabel("Season")
plt.ylabel(f'Offensive Efficiency\n *players with at least {int(fgm_median) + 1} field goals per season')
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles=handles[1:], labels=labels[1:])
plt.xticks(np.arange(2015, 2020, 1.0)) # custom x ticks
plt.title("Figure 15. Mean player's* Offensive Efficiency by season per team category,\nNCAA® tournaments.\n")
plt.show()
print("Descriptive statistics for file nr. {}:".format(str(file_nr-1)))
players_season.groupby("LABEL")[['OE']].describe()
This plot will not be included in the Results section.
import matplotlib.patches as patches
from matplotlib.offsetbox import (OffsetImage, AnnotationBbox)
plt.figure(figsize=(14,5))
df = players_season.groupby(['EventPlayerID',
'FirstName',
'LastName'], as_index=False).mean().sort_values('OE', ascending=False)
sns.barplot(y=df["FirstName"][:5] + " " + df["LastName"][:5],
x='OE', data = df[:5], color="#0A6FAC", orient='h')
plt.xlabel("Mean Offensive Efficiency by season")
plt.title("Figure 8. Top offensive players in 5 years,\n2015-2019 NCAA® tournaments.\n")
save_plot()
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:5].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
Note that we decided not to include following plots to the Results section as we found them not enough innovative, but we will keep them in our study for the reader's independent exploration.
Calculate mean per game stats
We already have summary events per game, now we will group by player and calculate mean:
mean_per_game = sum_per_game_tourney_MEvents.groupby(['EventPlayerID',
'FirstName',
'LastName'], as_index=False).mean()
mean_per_game
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('FGM', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='FGM', data = df[:7], color=colors[1], orient='h')
plt.xlabel("Mean field goals made")
plt.title("Field goals per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:7].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,6))
df = mean_per_game.sort_values('made3', ascending=False)
sns.barplot(y=df["FirstName"][:12] + " " + df["LastName"][:12],
x='made3', data = df[:12], color=colors[1], orient='h')
plt.xlabel("Mean 3-point field goals made")
plt.title("3-point field goals per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:10].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('made1', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='made1', data = df[:7], color=colors[1], orient='h')
plt.xlabel("Mean free throws made")
plt.title("Free throws per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:7].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('assist', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='assist', data = df[:7], color=colors[1], orient='h')
plt.xlabel("Mean assists")
plt.title("Assists per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:7].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('DREB', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='DREB', data = df[:7], color=colors[0], orient='h')
plt.xlabel("Mean defensive rebounds")
plt.title("Defensive rebounds per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:7].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('block', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='block', data = df[:7], color=colors[0], orient='h')
plt.xlabel("Mean blocks")
plt.title("Blocks per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:7].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
plt.figure(figsize=(14,3.5))
df = mean_per_game.sort_values('steal', ascending=False)
sns.barplot(y=df["FirstName"][:7] + " " + df["LastName"][:7],
x='steal', data = df[:7], color=colors[0], orient='h')
plt.xlabel("Mean steals")
plt.title("Steals per game: top players,\n2015-2019 NCAA® tournaments.\n")
plt.show()
# Some players have equal names (e.g. Chris Harris Jr.) so we need to see the team list:
for index, row in df[:5].iterrows():
# from this df:
player_id = df['EventPlayerID'][index]
firstname = df['FirstName'][index]
lastname = df['LastName'][index]
# from external dfs:
team_id = MPlayers.loc[MPlayers['EventPlayerID'] == player_id, 'EventTeamID'].values[0]
team_name = MTeams.loc[MTeams['TeamID'] == team_id, 'TeamName'].values[0]
print("{} {}, team {}".format(firstname,
lastname,
team_name))
# First add regular season and tournaments together:
labeled_CompactResults = pd.concat([labeled_MRegularSeasonCompactResults, labeled_MNCAATourneyCompactResults],
ignore_index=True)
cols = ['Season', 'DayNum', 'WTeamID', 'LTeamID']
# Next, add this data to our MEvents:
labeled_MEvents = MEvents.join(labeled_CompactResults.set_index(cols),
on=cols,
how='inner')
labeled_MEvents
labeled_MEvents['EventSubType'] = labeled_MEvents['EventSubType'].fillna(labeled_MEvents['EventType'])
labeled_MEvents = labeled_MEvents[labeled_MEvents.EventSubType != "unk"]
labeled_MEvents['EventType'].replace({"made1": "free throw made",
"miss1": "free throw missed"}, inplace=True)
labeled_MEvents['EventSubType'].replace({"1of1": "1 of 1",
"1of2": "1 of 2",
"2of2": "2 of 2",
"1of3": "1 of 3",
"2of3": "2 of 3",
"3of3": "3 of 3"}, inplace=True)
labeled_MEvents.sample(3)
labeled_MEvents[labeled_MEvents.EventType.isin(['turnover', 'foul'])].LABEL.value_counts()
fig = go.Figure()
parent_col = 'EventType'
child_col = 'EventSubType'
value_col = 'EventID'
i=0
for label in order:
center_label = label
df = labeled_MEvents[labeled_MEvents.LABEL == label]
df = df[df.EventType.isin(['turnover', 'foul'])]
df = df.groupby([parent_col, child_col], as_index=False).count()[[parent_col, child_col, value_col]]
# We need unique ids for repeated labels:
child_ids = list(df[parent_col] + " - " + df[child_col])
# Calculate values for parents:
parent_sums = [df[value_col].sum()] # first value is the sum of all rows
for parent in list(df[parent_col].unique()): # for each parent
parent_sums.append(df[df[parent_col] == parent][value_col].sum()) # add sum values
# Show final chart:
fig.add_trace(go.Sunburst(
ids = [center_label] + list(df[parent_col].unique()) + child_ids,
labels = [center_label] + list(df[parent_col].unique()) + list(df[child_col]),
parents = [""] + [center_label]*df[parent_col].nunique() + list(df[parent_col]),
values = parent_sums + list(df[value_col]),
textinfo='label+percent parent',
branchvalues="total",
domain=dict(column=i)))
i+=1
fig.update_layout(
grid= dict(columns=3, rows=1),
margin = dict(t=0, l=0, r=0, b=0),
uniformtext=dict(minsize=10, mode='hide')
)
fig.update_layout(
annotations=[
dict(
x=0.5,
y=0,
showarrow=False,
align="center",
text="<b>pers</b> - personal foul | <b>off</b> - offensive foul<br>\
<b>bpass</b> - bad pass turnover | <b>lostb</b> - lost ball | <b>offen</b> - offensive turnover | \
<b>trav</b> - travelling | <b>other</b> - other type of turnover",
xref="paper",
yref="paper",
font=dict(size=14),
)
])
fig.update_layout(
colorway=["#17344B","#D485AF"],
title={
'text': "Fouls and turnovers by subtype per team category,<br>2015-2019. Interactive graph.",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000")
)
fig.show(renderer="kaggle")
fig_9 = go.Figure(fig) # to show the same fig in the Results section
labeled_MEvents[labeled_MEvents.EventType.isin(['free throw made', 'free throw missed'])].LABEL.value_counts()
fig = go.Figure()
parent_col = 'EventType'
child_col = 'EventSubType'
value_col = 'EventID'
i=0
for label in order:
center_label = label
df = labeled_MEvents[labeled_MEvents.LABEL == label]
df = df[df.EventType.isin(['free throw made', 'free throw missed'])]
df = df.groupby([parent_col, child_col], as_index=False).count()[[parent_col, child_col, value_col]]
# We need unique ids for repeated labels:
child_ids = list(df[parent_col] + " - " + df[child_col])
# Calculate values for parents:
parent_sums = [df[value_col].sum()] # first value is the sum of all rows
for parent in list(df[parent_col].unique()): # for each parent
parent_sums.append(df[df[parent_col] == parent][value_col].sum()) # add sum values
# Show final chart:
fig.add_trace(go.Sunburst(
ids = [center_label] + list(df[parent_col].unique()) + child_ids,
labels = [center_label] + list(df[parent_col].unique()) + list(df[child_col]),
parents = [""] + [center_label]*df[parent_col].nunique() + list(df[parent_col]),
values = parent_sums + list(df[value_col]),
textinfo='label+percent parent',
branchvalues="total",
domain=dict(column=i)))
i+=1
fig.update_layout(
grid= dict(columns=3, rows=1),
margin = dict(t=0, l=0, r=0, b=0),
uniformtext=dict(minsize=10, mode='hide')
)
fig.update_layout(
colorway=["#0173B2","#DC143C"],
title={
'text': "Free throw attempts per team category,<br>2015-2019. Interactive graph.",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000")
)
fig.show(renderer="kaggle")
fig_4 = go.Figure(fig) # to show the same fig in the Results section
seasons = list(labeled_MEvents['Season'].unique())
df = labeled_double_MRegularSeasonDetailedResults[labeled_double_MRegularSeasonDetailedResults.Season.isin(seasons)]
cinderella_vs_ordinary(df, "played", "regular season of 2015-2019", "FTA")
cinderella_vs_ordinary(df, "played", "regular season of 2015-2019", "FTM")
df = labeled_double_MNCAATourneyDetailedResults[labeled_double_MNCAATourneyDetailedResults.Season.isin(seasons)]
cinderella_vs_ordinary(df, "played", "tournaments of 2015-2019", "FTA")
cinderella_vs_ordinary(df, "played", "tournaments of 2015-2019", "FTM")
Data Section 4 file: MMasseyOrdinals.csv - this file lists out rankings (e.g. 1, 2, 3, ..., N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies. By convention, the final pre-tournament rankings are always expressed as RankingDayNum=133, even though sometimes the rankings for individual systems are not released until Tuesday (DayNum=134) or even Wednesday or Thursday [1].
MMasseyOrdinals = None
MMasseyOrdinals = load_file(MMasseyOrdinals, 'MMasseyOrdinals')
Filter out tournament teams
Only include rows if both season and team ID is present in the tourney data:
MMasseyOrdinals = MMasseyOrdinals[(MMasseyOrdinals['Season'].isin(MNCAATourneyCompactResults['Season']) &
(MMasseyOrdinals['TeamID'].isin(MNCAATourneyCompactResults['WTeamID'])))]
MMasseyOrdinals
Add the labels - Ordinary, Cinderella and Top
cols = ['Season', 'TeamID']
labeled_MMasseyOrdinals = MMasseyOrdinals.join(cinderellas.set_index(cols), on=cols)
labeled_MMasseyOrdinals = labeled_MMasseyOrdinals.join(top_seeded.set_index(cols), on=cols)
# Create a categorical LABEL column:
label = labeled_MMasseyOrdinals[['Cinderella', 'Top']]
label = pd.DataFrame(label.idxmax(1))
labeled_MMasseyOrdinals['LABEL'] = label
# Fill in the missing values:
labeled_MMasseyOrdinals['LABEL'] = labeled_MMasseyOrdinals['LABEL'].fillna("Ordinary")
# Fill in the missing values:
labeled_MMasseyOrdinals['Cinderella'] = labeled_MMasseyOrdinals['Cinderella'].fillna(0) # not a cinderella
labeled_MMasseyOrdinals['Top'] = labeled_MMasseyOrdinals['Top'].fillna(0) # not a top
assert len(labeled_MMasseyOrdinals) == len(MMasseyOrdinals)
labeled_MMasseyOrdinals
How many rows per each category?
labeled_MMasseyOrdinals.LABEL.value_counts()
df = labeled_MMasseyOrdinals
# Make plot:
sns.barplot(df['RankingDayNum'], df['OrdinalRank'], hue=df['LABEL'],
hue_order=order, dodge=False, errwidth=1.5, alpha=0.75)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.075), ncol=3, fancybox=True)
plt.xlabel("Day no. (of regular season)")
plt.ylabel("Mean overall ranking")
# Fewer x ticks:
for tick_label in plt.gca().xaxis.get_ticklabels()[::2]:
tick_label.set_visible(False)
plt.xticks(rotation=90)
plt.show()
Some bars look odd, checking why:
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 20].LABEL.value_counts())
print("\n")
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 20].SystemName.value_counts())
print("\n")
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 20].OrdinalRank.describe())
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 96].LABEL.value_counts())
print("\n")
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 96].SystemName.value_counts())
print("\n")
print(labeled_MMasseyOrdinals[labeled_MMasseyOrdinals.RankingDayNum == 96].OrdinalRank.describe())
labeled_MMasseyOrdinals.SystemName.nunique()
def ranking_comparison(df, lower, upper):
'''Compare Cinderella team rankings vs. other team categories'''
df_top = df[df.LABEL == 'Top']
df_cinderella = df[df.LABEL == 'Cinderella']
df_ordinary = df[df.LABEL == 'Ordinary']
total_top_games = len(df_top)
total_cinderella_games = len(df_cinderella)
total_ordinary_games = len(df_ordinary)
between_medians_top = len(df_top[(df_top.OrdinalRank > lower) &
(df_top.OrdinalRank < upper)])
between_medians_cinderella = len(df_cinderella[(df_cinderella.OrdinalRank > lower) &
(df_cinderella.OrdinalRank < upper)])
between_medians_ordinary = len(df_ordinary[(df_ordinary.OrdinalRank > lower) &
(df_ordinary.OrdinalRank < upper)])
share = between_medians_cinderella/total_cinderella_games
share_top = between_medians_top/total_top_games
share_ordinary = between_medians_ordinary/total_ordinary_games
if share > 0.51:
share_str = '{:.0%}'.format(share)
share_top_str = '{:.0%}'.format(share_top)
share_ordinary_str = '{:.0%}'.format(share_ordinary)
return share_str, share_top_str, share_ordinary_str
Final figure to include in the Results:
df = labeled_MMasseyOrdinals[~labeled_MMasseyOrdinals.SystemName.isin(["DES", "BIH"])]
print(f'{df.Season.min()}-{df.Season.max()}')
system_cnt = df.SystemName.nunique()
df = df.groupby(['LABEL', 'RankingDayNum'], as_index=False).mean()
i=0
for label in order:
plot_df = df[df.LABEL == label].sort_values(by='RankingDayNum')
plt.plot(plot_df['RankingDayNum'], plot_df['OrdinalRank'])
plt.fill_between(plot_df['RankingDayNum'], plot_df['OrdinalRank'], color=label_colors[i],
alpha=0.25, label=label)
i+=1
plt.ylim(0, 200)
plt.gca().margins(0)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.075), ncol=3, fancybox=True)
plt.xticks(np.arange(1, 134, 3))
plt.xticks(rotation=90)
plt.xlabel("Day no. (of regular season)")
plt.ylabel("BEST" + " "*30 + "Mean overall ranking" + " "*30 + "WORST")
plt.title("Figure 13. Team category vs. pre-tournament ranking\nacross {} ranking systems, 2003-2019.".format(system_cnt), y=1.1)
save_plot()
plt.show()
print("Descriptive statistics for file nr. {}:\n".format(str(file_nr-1)))
for label in order:
print("{}: median rank: {}, mean: {}.".format(label, int(df[df.LABEL == label]['OrdinalRank'].median()),
round(df[df.LABEL == label]['OrdinalRank'].mean(), 2)))
share_str, share_top_str, share_ordinary_str = ranking_comparison(labeled_MMasseyOrdinals, 20, 80)
print(f'\nIn {share_str} public rankings (of 172 ranking systems) in 2003-2019,'
f' Cinderella teams were ranked between 20'
f' and 80 vs. {share_top_str} for the Top and {share_ordinary_str} for Ordinary teams.')
Narrow down to 5 popular rating systems - Pomeroy (POM), Sagarin (SAG), RPI (RPI), ESPN BPI (EBP) and ESPN SOR (ESR)
The description of each system will be included in the Results section.
labeled_MMasseyOrdinals_five = labeled_MMasseyOrdinals[labeled_MMasseyOrdinals['SystemName'].isin(['POM', 'SAG', 'RPI', 'EBP', 'ESR'])]
labeled_MMasseyOrdinals_five.SystemName.value_counts()
subplot_titles=['Pomeroy', 'RPI', 'Sagarin', 'ESPN BPI', 'ESPN SOR']
fig = make_subplots(rows=5, cols=1,
shared_xaxes=True, subplot_titles=subplot_titles, vertical_spacing = 0.05)
row = 1 # row nr. for subplot
for system_name in ['POM', 'SAG', 'RPI', 'EBP', 'ESR']: # Make plots for each rating system:
i = 0
for label in order: # 'Ordinary', 'Cinderella', 'Top'
df = labeled_MMasseyOrdinals_five[labeled_MMasseyOrdinals_five.LABEL == label]
print(f'{df.Season.min()}-{df.Season.max()}' + " "*100, end="\r", flush=True)
df = df[df.SystemName == system_name]
fig.add_trace(
go.Box(x=df['OrdinalRank'],
name=label,
marker_color=sns.color_palette("colorblind").as_hex()[i],
boxmean=True, # represent mean
boxpoints="suspectedoutliers",
visible=True),
row=row, col=1)
i+=1
row+=1 # go to next subplot
fig.update_layout(showlegend=False, # hide ledend
width=plotly_width, height=800) # set size
# Set axis font:
fig.update_yaxes(tickfont=dict(size=14))
# Add titles:
fig.update_xaxes(title_text='Overall ranking from best to worst', row=5, col=1)
# Plot title:
fig.update_layout(
title={
'text': "Team category vs. pre-tournament ranking distribution,<br>2003-2019. Interactive graph.",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
family='sans-serif',
color="#000"),
margin=dict(t=120), # margin between title and plot
boxgap=0.35,
boxgroupgap=0
)
fig.show(renderer="kaggle")
fig_11 = go.Figure(fig) # to show the same fig in the Results section
share_str, share_top_str, share_ordinary_str = ranking_comparison(labeled_MMasseyOrdinals_five, 20, 80)
print(f'In {share_str} public rankings of Pomeroy, RPI, Sagarin, ESPN BPI and ESPN SOR in 2003-2019,'
' Cinderella teams were ranked between 20'
f' and 80 vs. {share_top_str} for the Top and {share_ordinary_str} for Ordinary teams.')
Filter out final pre-tournament rankings released before the play-in games:
df = labeled_MMasseyOrdinals_five[labeled_MMasseyOrdinals_five['RankingDayNum'] == 133]
print(f'{df.Season.min()}-{df.Season.max()}')
# plt.figure(figsize=(14,14))
sns.lineplot(x="Season", y="OrdinalRank", data=df, hue='LABEL',
style='SystemName', hue_order=order, ci=None).set(xlim=(2003, 2019))
ax = plt.gca()
legend = ax.legend()
legend.texts[0].set_text("Team category")
legend.texts[4].set_text("\nRanking system")
ax.invert_yaxis() # to show best rating on top
plt.xlabel("Season")
plt.ylabel("WORST" + " "*32 + "Overall ranking" + " "*32 + "BEST")
plt.title("Figure 14. Final pre-tournament rankings by season vs. team category\n")
save_plot()
plt.show()
share_str, share_top_str, share_ordinary_str = ranking_comparison(labeled_MMasseyOrdinals_five, 20, 65)
print(f'In {share_str} of final pre-tournament rankings in 2003-2019,'
' Cinderella teams were ranked between 20'
f' and 65 vs. {share_top_str} for the Top and {share_ordinary_str} for Ordinary teams.')
df[df.LABEL == 'Cinderella'].groupby(['Season', 'SystemName'])['OrdinalRank'].describe()
Note how in a season of 2013 rankings were more spread out from the average. In other words, this season has the biggest standard deviation from average (EBP: 52.16, POM: 44.43, RPI: 26.85, SAG: 43.02) of the pre-tournament rankings among 3 Cinderella teams - FL Gulf Coast, La Salle and Oregon.
Considering the above analysis, we will try to predict which team could become a Cinderella if the 2020 tournament would not be canceled.
We believe that rankings are important for "Cinderellaness", so we will build our input data based on available rankings. This will also mean that we will not use any data before season 2003.
MMasseyOrdinals = load_file(MMasseyOrdinals, 'MMasseyOrdinals')
MMasseyOrdinals.SystemName.value_counts().head()
It is important to have as many data samples as possible, so we will use 3 ranking systems that occur most frequently in a data:
ml_MMasseyOrdinals = MMasseyOrdinals[MMasseyOrdinals['SystemName'].isin(['SAG', 'MOR', 'POM'])]
ml_MMasseyOrdinals
ml_MMasseyOrdinals.SystemName.value_counts()
Calculate mean ranking per each system
ml_MMasseyOrdinals = ml_MMasseyOrdinals.groupby(['Season', 'TeamID', 'SystemName'], as_index=False).mean()
ml_MMasseyOrdinals = ml_MMasseyOrdinals.drop('RankingDayNum', 1)
ml_MMasseyOrdinals
One-hot encode mean rankings:
dummies = pd.get_dummies(ml_MMasseyOrdinals['SystemName'])
for col in dummies:
ml_MMasseyOrdinals[col] = dummies[col]*ml_MMasseyOrdinals['OrdinalRank']
ml_MMasseyOrdinals = ml_MMasseyOrdinals.groupby(['Season', 'TeamID'], as_index=False).sum()
ml_MMasseyOrdinals = ml_MMasseyOrdinals.drop('OrdinalRank', 1)
ml_MMasseyOrdinals
Select the same seasons as we have in rankings:
seasons = list(ml_MMasseyOrdinals.Season.unique())
ml_double_MRegularSeason = double_MRegularSeasonDetailedResults[double_MRegularSeasonDetailedResults.Season.isin(seasons)]
ml_double_MRegularSeason
Calculate mean metrics:
ml_double_MRegularSeason = ml_double_MRegularSeason.groupby(['Season', 'TeamID'], as_index=False).mean()
Add output label: Cinderella
cols = ['Season', 'TeamID']
ml_double_MRegularSeason = ml_double_MRegularSeason.join(cinderellas.set_index(cols), on=cols)
# Fill in the missing values:
ml_double_MRegularSeason['Cinderella'] = ml_double_MRegularSeason['Cinderella'].fillna(0)
ml_double_MRegularSeason
ml_double_MRegularSeason.Cinderella.value_counts()
We will manually select features that we believe contribute most to the "Cinderellaness".
We already chose 3 ranking systems and calculated mean rankings. Now we will select which columns to keep from the regular season data. Basically we would like remove metrics like field goals attempted (including two-point field goals and three-point field goals), opponent rebound columns and simply irrelevant columns like day number and winning / losing team IDs.
ml_double_MRegularSeason = ml_double_MRegularSeason.drop(['FGA', 'FGA2', 'FGA3',
'DayNum', 'WTeamID', 'LTeamID',
'OppOR', 'OppDR'], 1)
ml_double_MRegularSeason
cols = ['Season', 'TeamID']
ml_data = ml_double_MRegularSeason.join(ml_MMasseyOrdinals.set_index(cols),
on=cols,
how='inner').reset_index(drop=True)
ml_data
print(list(ml_data))
ml_data_2020 = ml_data[ml_data.Season == 2020]
ml_data_2020 = ml_data_2020.drop('Cinderella', 1)
ml_data_2020
ml_data = ml_data[ml_data.Season != 2020]
ml_data.Cinderella.value_counts()
Note that our data is imbalanced - only 34 Cinderella cases vs. 5799 non-Cinderella cases. This could be a potential problem for a classification model. We will address this later.
Prepare X (input), y (output) data for a machine learning:
X = ml_data.loc[:, ml_data.columns != 'Cinderella']
y = ml_data[['Cinderella']]
print(X.shape, y.shape)
We will leave 25% of data as a test set that our model will not use for training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
y_train.Cinderella.value_counts()
y_test.Cinderella.value_counts()
Considering that our data is imbalanced, we will use a DummyClassifier model as a baseline. The model will simply use 'most_frequent' strategy and predict the most frequent label (non-Cinderella).
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
clf_dummy = DummyClassifier(strategy='most_frequent', random_state=0).fit(X_train, y_train)
y_pred = clf_dummy.predict(X_test)
print("Dummy model accuracy (most frequent label): %0.2f" % (accuracy_score(y_test, y_pred)))
Note how we got 99% accuracy score on a dummy model, so in our case it is not a good metric to evaluate the real model. We will use metrics that are more suitable for imbalanced datasets: balanced accuracy score, F1 score and ROC AUC.
from sklearn.model_selection import cross_validate
scoring = ['balanced_accuracy', 'f1_macro', 'roc_auc']
scores = cross_validate(clf_dummy, X_train, y_train, cv=5, scoring=scoring)
sorted(scores.keys())
scores['test_roc_auc']
def print_scores(scores):
'''Print out classification metrics'''
print("Balanced accuracy: %0.2f (+/- %0.2f)" % (scores['test_balanced_accuracy'].mean(), scores['test_balanced_accuracy'].std() * 2))
print("F1 score: %0.2f (+/- %0.2f)" % (scores['test_f1_macro'].mean(), scores['test_f1_macro'].std() * 2))
print("ROC AUC: %0.2f (+/- %0.2f)" % (scores['test_roc_auc'].mean(), scores['test_roc_auc'].std() * 2))
print("Baseline model scores:\n")
print_scores(scores)
Now when we have our dummy baseline model, we will test out different models and check which one gives better results. Our goal is to train a model that will have at least 0.60 F1 score and 0.70 ROC AUC on a test data it had never seen.
Considering historical data, we expect that there could be 0 to 5 Cinderella teams in 2020 (if the tournament would not be canceled).
As a matter of fact, 1 to 3 Cinderella teams per season is what we would expect the most:
# Cinderellas per season:
season_team_cinderellas.groupby('Season').count()['Cinderella'].describe()
We will train several models and evaluate the results. We will use cross-validation instead of a train / test data, because our dataset is small and we want to maximize the number of samples which can be used for learning the model. We will leave test dataset untouched for now and use only train dataset for both model training and cross-validation.
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection [27].
from sklearn import svm
clf = svm.SVC(kernel='rbf')
# Evaluating estimator performance:
scores = cross_validate(clf, X_train, y_train, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train, y_train) # training the model
pred = clf.predict(ml_data_2020) # predicting an output
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
clf = svm.SVC(kernel='rbf', class_weight='balanced')
# Evaluating estimator performance:
scores = cross_validate(clf, X_train, y_train, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train, y_train) # training the model
pred = clf.predict(ml_data_2020) # predicting an output
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
# Evaluating estimator performance:
scores = cross_validate(clf, X_train, y_train, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train, y_train) # training the model
pred = clf.predict(ml_data_2020) # predicting an output
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way [29].
from xgboost import XGBClassifier
clf = XGBClassifier()
# Evaluating estimator performance:
scores = cross_validate(clf, X_train, y_train, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train, y_train) # training the model
pred = clf.predict(ml_data_2020) # predicting an output
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
Out of 4 different models that we have trained, one (Support Vector Classification, Balanced) had a balanced accuracy (0.83) greater than our dummy baseline model (0.50). Unfortunately, the same model had a lowest F1 score (0.43) of all four and has predicted 97 teams to become a Cinderellas.
A major source of limitation is due to imbalanced data. To address this issue, we will perform over-sampling using SMOTE - Synthetic Minority Over-sampling Technique.
Perform over-sampling using SMOTE:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)
y_train_resampled.Cinderella.value_counts()
After over-sampling is done, our new data has 4348 Cinderella cases and 4348 non-Cinderella cases.
Note that we will not use the "balanced" model now, because we already improved data balance by over-sampling.
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
clf = svm.SVC(kernel='rbf')
# Evaluating estimator performance:
scores = cross_validate(clf, X_train_resampled, y_train_resampled, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train_resampled, y_train_resampled) # training the model
y_pred = clf.predict(X_test) # using the model on a test set
pred = clf.predict(ml_data_2020) # predicting an output
# Evaluating performance on a new data:
print("\nF1 score (test data): %0.2f" % (f1_score(y_test, y_pred, average='macro')))
print("ROC AUC (test data): %0.2f" % (roc_auc_score(y_test, y_pred)))
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
clf = RandomForestClassifier()
# Evaluating estimator performance:
scores = cross_validate(clf, X_train_resampled, y_train_resampled, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train_resampled, y_train_resampled) # training the model
y_pred = clf.predict(X_test) # using the model on a test set
pred = clf.predict(ml_data_2020) # predicting an output
# Evaluating performance on a new data:
print("\nF1 score (test data): %0.2f" % (f1_score(y_test, y_pred, average='macro')))
print("ROC AUC (test data): %0.2f" % (roc_auc_score(y_test, y_pred)))
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
clf = XGBClassifier()
# Evaluating estimator performance:
scores = cross_validate(clf, X_train_resampled, y_train_resampled, cv=5, scoring=scoring)
print_scores(scores)
clf.fit(X_train_resampled, y_train_resampled) # training the model
y_pred_XGB = clf.predict(X_test) # using the model on a test set
pred = clf.predict(ml_data_2020) # predicting an output
# Evaluating performance on a new data:
print("\nF1 score (test data): %0.2f" % (f1_score(y_test, y_pred_XGB, average='macro')))
print("ROC AUC (test data): %0.2f" % (roc_auc_score(y_test, y_pred_XGB)))
print(f'\nCinderellas in 2020: {np.count_nonzero(pred == 1)}')
After over-sampling, all three models have improved results on a new training data.
SVC model resulted with average balanced accuracy of 0.85, F1 score of 0.85 and ROC AUC of 0.86, but didn't perform well on a test data, having a low F1 score of 0.45.
Random Forest model had balanced accuracy and F1 score of 0.99 and ROC AUC of 1.00 in cross validation. On a test data, it showed F1 score of 0.56 and ROC AUC of 0.62.
XGBoost model performed best and showed the same cross validation metrics as Random Forest model, but better results with the test data - 0.61 F1 score and 0.68 ROC AUC.
clf.get_params() # current model parameters
'eta' - Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
scores = []
for eta in np.arange(0, 1, 0.15):
clf = XGBClassifier(eta=eta).fit(X_train_resampled, y_train_resampled)
y_pred = clf.predict(X_test) # using the model on a test set
score = f1_score(y_test, y_pred, average='macro')
scores.append(score)
print("eta: {:.2f} / ".format(eta) + "F1 score (test data): %0.2f" % (score))
plt.figure(figsize = (7.5,3))
plt.xlabel('eta')
plt.ylabel('F1 score')
plt.plot(np.arange(0, 1, 0.15), scores)
'max_depth' - Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth.
scores = []
for max_depth in range(1, 10):
clf = XGBClassifier(max_depth=max_depth).fit(X_train_resampled, y_train_resampled)
y_pred = clf.predict(X_test) # using the model on a test set
score = f1_score(y_test, y_pred, average='macro')
scores.append(score)
print("max_depth: {:.2f} / ".format(max_depth) + "F1 score (test data): %0.2f" % (score))
plt.figure(figsize = (7.5,3))
plt.xlabel('max_depth')
plt.ylabel('F1 score')
plt.plot(range(1, 10), scores)
'subsample' - Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
scores = []
for subsample in np.arange(0, 1, 0.10):
clf = XGBClassifier(subsample=subsample).fit(X_train_resampled, y_train_resampled)
y_pred = clf.predict(X_test) # using the model on a test set
score = f1_score(y_test, y_pred, average='macro')
scores.append(score)
print("subsample: {:.2f} / ".format(subsample) + "F1 score (test data): %0.2f" % (score))
plt.figure(figsize = (7.5,3))
plt.xlabel('subsample')
plt.ylabel('F1 score')
plt.plot(np.arange(0, 1, 0.10), scores)
'n_estimators' – Number of gradient boosted trees. Equivalent to number of boosting rounds.
scores = []
for n_estimators in [50, 100, 200, 250, 500]:
clf = XGBClassifier(n_estimators=n_estimators).fit(X_train_resampled, y_train_resampled)
y_pred = clf.predict(X_test) # using the model on a test set
score = f1_score(y_test, y_pred, average='macro')
scores.append(score)
print("n_estimators: {:.2f} / ".format(n_estimators) + "F1 score (test data): %0.2f" % (score))
plt.figure(figsize = (7.5,3))
plt.xlabel('n_estimators')
plt.ylabel('F1 score')
plt.plot([50, 100, 200, 250, 500], scores)
Train the best model and make final predictions:
model = XGBClassifier(subsample = 0.2, n_estimators = 200)
model.fit(X_train_resampled, y_train_resampled) # training the model
y_pred = model.predict(X_test) # using the model on a test set
# Evaluating performance on a new data:
print("ROC AUC (test data): %0.2f\n" % (roc_auc_score(y_test, y_pred)))
pred = model.predict(ml_data_2020) # predicting an output
print(f'Cinderellas in 2020: {np.count_nonzero(pred == 1)}')
pred_proba = model.predict_proba(ml_data_2020) # also get probabilities
Compare updated model with the previous one:
from sklearn.metrics import classification_report
print("\nXGBoost:")
print(classification_report(y_test, y_pred_XGB))
print("\nXGBoost (tuned parameters):")
print(classification_report(y_test, y_pred))
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots(1,2, figsize = (12,5))
cm = confusion_matrix(y_test, y_pred_XGB)
cm_labels = ['Other', 'Cinderella']
sns.heatmap(cm,
cmap=sns.cubehelix_palette(),
cbar=False,
annot=True, annot_kws={"size": 13.5}, fmt='g',
xticklabels=cm_labels,
yticklabels=cm_labels, ax=ax[0])
ax[0].set_title("XGBoost\nF1 score: 0.61\n")
ax[0].set_xlabel("\nPredicted label")
ax[0].set_ylabel("True label\n")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm,
cmap=sns.cubehelix_palette(),
cbar=False,
annot=True, annot_kws={"size": 13.5}, fmt='g',
xticklabels=cm_labels,
yticklabels=cm_labels, ax=ax[1])
ax[1].set_title("XGBoost (tuned parameters)\nF1 score: 0.63\n")
ax[1].set_xlabel("\nPredicted label")
ax[1].set_ylabel("True label\n")
plt.subplots_adjust(wspace=0.35)
plt.show()
# Determine the way floating point numbers, arrays and other NumPy objects are displayed:
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
# Each probability will be formatted as so:
pred_proba[0]
Find out which teams are predicted to become a Cinderellas
ml_data_2020['Cinderella'] = pred
ml_data_2020['Probability'] = np.split(pred_proba, 2, 1)[1]
pred_cinderellas = ml_data_2020[ml_data_2020.Cinderella == 1][['Season', 'TeamID', 'Probability']]
pred_cinderellas
Team names:
MTeams[MTeams.TeamID.isin(pred_cinderellas.TeamID)]
Seed numbers from previous seasons, just an additional check:
for team in pred_cinderellas.TeamID:
print(f'Team: {MTeams.loc[MTeams.TeamID == team, "TeamName"].values[0]}')
print(MNCAATourneySeeds[MNCAATourneySeeds.TeamID == team].sort_values('Season', ascending=False).head())
print('\n')
Based on our experiment, 6 teams were the potential candidates to become a Cinderellas: Arizona St, ETSU, Florida, Illinois, Indiana and Providence.
# Show saved images in the "Results" section:
output_dir = '/kaggle/working/'
if sys.executable != '/opt/conda/bin/python':
# remove the forward slash if running this notebook locally:
output_dir = output_dir[1:]
def display_img(filename):
if os.path.isfile(output_dir + filename):
display(Image(output_dir + filename))
else:
print("Image not found. Re-run this cell when the Implementation section is executed!")
fig_error = "Graph not found. Re-run this cell when the Implementation section is executed!"
def display_fig(fig):
fig.show(renderer="kaggle")
This section summarises general findings about men's NCAA® basketball across different seasons from 1985 to 2020.
All calculations for the numbers mentioned in our findings are available in the Implementation section.
display_img("07.png")
try:
display_fig(fig_1)
except NameError:
print(fig_error)
display_img("09.png")
Based on a 2015-2019 data:
Highest 2-point shot accuracy is achieved in the basket area. Accuracy is lowest at about 3 meters from the basket for two-point goals and there is an interesting peak at about 5 meters where the accuracy improves a little bit.
For the best 3-point shot accuracy player should make a shot from just behind the three-point line. Shots made from about 9.5-meter distance are more accurate comparing to those made from 8-meter distance.
display_img("08.png")
For the data available with court coordinates (2019-2020), about 69% of the three-point goals were made from either "outside right" or "outside left" area, and only 21% were made from the "outside center" area.
Could it be because of a possibly stronger defense in the center area? While our data is lacking X, Y coordinates for defensive events like rebounds, blocks and steals, we can see on the second figure that 41% of the turnovers beyond the three-point line happened just about there in the center. There are many actions that can result in a turnover, including: ball stolen by opposing team, throwing a bad pass, throwing the ball out of bounds, stepping out of bounds, committing a double-dribble, palming or traveling violation, committing a backcourt violation, shot clock violation, three-second violation, five-second violation or an offensive foul (charge or illegal screen) [5].
In addition, we assume that if more data would be available, the goals could be distributed more evenly along the three-point line.
display_img("01.png")
Note. Each point in a scatter plot represents an observation in the dataset. In this figure we are looking at a relationship between game location (for the winner team) and points scored. We see that dark points (opponent floor games) are lower and tend a little bit towards the right side (less points for visitor, more points for home team) and pink points (home games) are higher and tend slightly towards left (again, more points for home team, less points for visitor).
Based on a 1985-2020 data:
display_img("02.png")
Note. In this graph we plot mean scoring margin along with a 95% confidence interval for that mean (a range of values that we can be 95% certain contains the mean).
Based on a 1985-2020 data:
display_img("11.png")
Note. To calculate Offensive Efficiency we used the following formula: (FGM + A) / (FGA - OREB + A + TO) [7]
Based on a 2015-2019 data, top 5 offensive players in five years using Offensive Efficiency metric by season would be:
In this section we demonstrate our findings about a key features that define a Cinderella team compared to other team categories.
display_img("03.png")
try:
display_fig(fig_2)
except NameError:
print(fig_error)
Based on a 1985-2019 data:
display_img("04.png")
Note. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be outliers [13].
Based on a 1985-2019 data:
display_img("05.png")
Based on a 1985-2019 data:
try:
display_fig(fig_3)
except NameError:
print(fig_error)
Note. In this graph and all similar graphs, use dropdown menu and / or buttons to switch between different states.
Based on a 2003-2019 data:
try:
display_fig(fig_4)
except NameError:
print(fig_error)
Based on a 2015-2019 data:
Cinderella teams have the biggest share of missed free throws (32% missed vs. 68% made) among all team categories.
Cinderellas have missed the "first of two" free throw attempts in 37% of attempts and have made a successful "first of two" shot in 42% of shots.
In 56% of games played in tournaments of 2015-2019, Cinderella teams had less than 13 free throws made (mean: 11.78, median: 12) vs. 49% of games for the Ordinary teams (mean: 12.89, median: 13).
Despite imbalanced labels (70524 free throw events for Ordinary, 650 for Cinderella and 7522 for Top teams), the structure of free throw attempts looks very similar for all three categories.
try:
display_fig(fig_5)
except NameError:
print(fig_error)
Based on a 2003-2019 data:
try:
display_fig(fig_6)
except NameError:
print(fig_error)
try:
display_fig(fig_7)
except NameError:
print(fig_error)
Based on a 2003-2019 data:
try:
display_fig(fig_8)
except NameError:
print(fig_error)
Based on a 2015-2019 data:
Cinderellas might be able to defend without fouling - they had better blocks per personal fouls ratio than Ordinary teams in both regular season (21.4% vs. 17.8%) and NCAA® tournaments (21.2% vs. 16.7%).
In 59% of games played in tournaments, Cinderella teams had less than 18 personal fouls (mean: 16.72, median: 16.0) vs. 45% of games for the Ordinary teams (mean: 18.21, median: 18.0).
try:
display_fig(fig_9)
except NameError:
print(fig_error)
Based on a 2015-2019 data:
Foul share of total "foul plus turnover" events for Cinderella (53%) is greater than the same for Ordinary teams (51%) but less than the same for Top teams (55%).
Cinderellas have almost identical turnover structure as Ordinary teams, both having equal share of bad pass turnover (14%), lost ball turnover (13%), offensive turnover (5%) and travelling turnover (4%).
Top teams have smaller personal foul share (45% vs. 48%) and greater offensive foul share (6% vs. 5%) than Cinderella teams.
try:
display_fig(fig_10)
except NameError:
print(fig_error)
Note. To calculate Rebound Margin we used the following formula: RPG - OPP RPG [15].
Based on a 2003-2019 data:
In 60% of games played in regular season, Cinderella teams had a positive Rebound Margin (mean: 2.44, median: 2.0) vs. 47% of games for the Ordinary teams (mean: -0.17, median: 0.0).
Cinderella teams had a Rebound Margin greater than -2.0 (mean: -0.97, median: 0.0) in 54% of games played in tournaments vs. 49% of games for the Ordinary teams (mean: -1.71, median: -2.0).
If we look at the games won only, Cinderella teams have lowest Rebound Margin among all team categories (mean: 1.48, median: 2.0), so we should not credit rebounding for a Cinderella's success in game. We also have to be mindful of the fact that this metric does not take into account defensive vs. offensive rebounds.
display_img("10.png")
Based on a 2015-2019 data:
display_img("12.png")
try:
display_fig(fig_11)
except NameError:
print(fig_error)
Based on a 2003-2019 data:
What is the difference between Pomeroy, RPI, Sagarin, ESPN BPI and ESPN SOR?
Pomeroy - Ken Pomeroy ranking system that incorporates statistics like shooting percentage, margin of victory, and strength of schedule, ultimately calculating offensive, defensive, and overall "efficiency" numbers for all teams in Division I. Higher-ranked teams are predicted to beat lower-ranked teams on a neutral court [16].
RPI - the Rating Percentage Index (RPI) has been used by the NCAA men's basketball committee since 1981, as supplemental data to help select at-large teams and seed all teams for the men's and women's NCAA basketball tournaments. The three component factors which make up the RPI are as follows: (25%) the team's Division I winning percentage, (50%) team's opponents' Division I winning percentage, (25%) team's opponents' opponents' Division I winning percentage [17].
Sagarin - Jeff Sagarin rankings that aim to do the same thing as the Pomeroy ratings, but use a different formula, one that doesn't (appear to) factor in stats like shooting percentage (though the algorithm is proprietary and, thus, not entirely transparent) [16]. The overall rating is a synthesis of the three different score-based methods: PREDICTOR, GOLDEN_MEAN, and RECENT [18].
ESPN BPI - a predictive rating system for college basketball that's designed to measure team strength and project performance going forward. In the simplest sense, BPI (College Basketball Power Index) is a power rating that can be used to determine how much better one team is than another [19].
ESPN SOR - ESPN's Strength of Record takes strength of schedule a step further by accounting for how a team actually did against its schedule. Unlike BPI, which accounts for how the game was won, Strength of Record simply cares about the difficulty of a team’s schedule and the result (win or loss) [19].
display_img("13.png")
We have trained a machine learning model to predict which teams could become a Cinderellas in season 2020 if the tournament would not be canceled.
Our final model used XGBoost classifier. It was able to predict whether or not a team will be a Cinderella with a 0.98 accuracy on a data it had never seen. Considering that input data was heavily imbalanced (only 34 Cinderella cases vs. 5799 non-Cinderella cases), we used F1 score and ROC AUC (area under the ROC curve) metrics to evaluate final results.
We acknowledge that "Cinderellaness" is a tricky feature that is not straightforward to predict, so we appreciated achieving a macro average F1 score of 0.63 and ROC AUC of 0.74 on a test dataset.
According to our results, top 3 potential Cinderella teams of 2020 could be:
Although we do not have a ground truth data to check our predictions, we have verified that a similar assumption about ETSU Cinderella potential was discussed in SPORTS ILLUSTRATED [20], NBC Sports [21] and USA TODAY [22].
Please refer to Implementation section to see all 6 Cinderella candidates that our model has predicted.
[1] Kaggle. (2020). Google Cloud & NCAA® March Madness Analytics. Data Description. [Online]. Available: https://www.kaggle.com/c/march-madness-analytics-2020/data
[2] NCAA. (2020). NCAA cancels remaining winter and spring championships. [Online]. Available: https://www.ncaa.org/about/resources/media-center/news/ncaa-cancels-remaining-winter-and-spring-championships
[3] J. Boozell. (2019). The 11 greatest March Madness Cinderella stories. [Online]. Available: https://www.ncaa.com/news/basketball-men/2019-02-21/11-greatest-march-madness-cinderella-stories
[4] K. Bonsor and D. Roos. (2003). How March Madness Works. [Online]. Available: https://entertainment.howstuffworks.com/march-madness.htm
[5] Jr NBA. (n.d.). Turnover. [Online]. Available: https://jr.nba.com/turnover/
[6] NCAA. (ca. 2020). Selection Criteria. [Online]. Available: http://www.ncaa.org/about/resources/media-center/mens-basketball-selections-101-selections
[7] Thunder StatLab. (n.d.). OFFENSIVE EFFICIENCY. [Online]. Available: https://www.nba.com/resources/static/team/v2/thunder/statlab-OE-191201.pdf
[8] Texas Tech University. (ca. 2020). NORENSE ODIASE. [Online]. Available: https://texastech.com/sports/mens-basketball/roster/norense-odiase/6580
[9] University Athletic Assoc., Inc., FOX Sports Sun & IMG College. (ca. 2019). GORJOK GAK. [Online]. Available: https://floridagators.com/sports/mens-basketball/roster/gorjok-gak/11067
[10] E. Giambalvo. (2020). Maryland basketball’s Jalen Smith earns third-team all-American honors. [Online]. Available: https://www.washingtonpost.com/sports/2020/03/20/maryland-basketballs-jalen-smith-earns-third-team-all-american-honors/
[11] R. Wilson. (2020). Can 'unbreakable' Tyrique Jones carry Xavier into NCAA Tournament? [Online]. Available: https://www.wcpo.com/sports/college-sports/xavier-university-sports/can-unbreakable-tyrique-jones-carry-xavier-into-ncaa-tournament
[12] Oklahoma State University Athletics. (ca. 2020). MITCHELL SOLOMON. [Online]. Available: https://okstate.com/sports/mens-basketball/roster/mitchell-solomon/4051
[13] Seaborn. (n.d.). seaborn.boxplot. [Online]. Available: https://seaborn.pydata.org/generated/seaborn.boxplot.html
[14] M. Badger. (ca. 2014). Stat Central: Understanding Strengths, Shortcomings Of Assist Rate Metrics. [Online]. Available: https://hoopshabit.com/2013/08/18/stat-central-understanding-strengths-shortcomings-of-assist-rate-metrics/
[15] NCAA. (2020). Men's Basketball. TEAM STATISTICS. REBOUND MARGIN. [Online]. Available: https://www.ncaa.com/stats/basketball-men/d1/current/team/151
[16] S. Paruk. (2020). Which Advanced Metric Should Bettors Use: KenPom or Sagarin? [Online]. Available: https://www.sportsbettingdime.com/guides/strategy/kenpom-vs-sagarin/
[17] Collegiate Basketball News Company. (n.d.). What is the RPI? [Online]. Available: http://rpiratings.com/WhatisRPI.php
[18] J. Sagarin. (2020). Jeff Sagarin's College Basketball Ratings. [Online]. Available: http://sagarin.com/sports/cbsend.htm
[19] ESPN Sports Analytics Team. (2016). BPI and Strength of Record: What are they and how are they derived? [Online]. Available: https://www.espn.com/blog/statsinfo/post/_/id/125994/bpi-and-strength-of-record-what-are-they-and-how-are-they-derived
[20] K. Sweeney. (2020). Cinderella Spotlight: Steve Forbes Has Built a Mid-Major Force at East Tennessee State. [Online]. Available: https://www.si.com/college/2020/03/11/march-madness-cinderellas-etsu-basketball
[21] R. Dauster. (2020). Introducing Cinderella: East Tennessee State doesn’t need an at-large bid anymore. [Online]. Available: https://collegebasketball.nbcsports.com/2020/03/09/introducing-cinderella-east-tennessee-state-doesnt-need-an-at-large-bid-anymore/
[22] S. Gleeson. (2020). Six mid-major teams that had potential to be Cinderella before coronavirus canceled March Madness. [Online]. Available: https://eu.usatoday.com/story/sports/ncaab/2020/03/16/coronavirus-march-madness-ncaa-tournament-cinderella-potential/5012987002/
[23] D. Wilco. (2020). What is March Madness: The NCAA tournament explained. [Online]. Available: https://www.ncaa.com/news/basketball-men/bracketiq/2020-04-20/what-march-madness-ncaa-tournament-explained
[24] City location coordinates obtained via GeoPy Nominatim geocoder for OpenStreetMap data. (The MIT License). [Online]. Available: https://www.kaggle.com/evanca/ncaageocities
[25] Court outline image (Figures 4-6) courtesy of author.
[26] K. Bonsor. (2003). How Basketball Works. Scoring. [Online]. Available: https://entertainment.howstuffworks.com/basketball4.htm
[27] Scikit-learn. (n.d.). Support Vector Machines. [Online]. Available: https://scikit-learn.org/stable/modules/svm.html
[28] Scikit-learn. (n.d.). 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[29] Xgboost developers. (n.d.). XGBoost Documentation. [Online]. Available: https://xgboost.readthedocs.io/en/latest/
[30] Xgboost developers. (n.d.). XGBoost Parameters. [Online]. Available: https://xgboost.readthedocs.io/en/latest/parameter.html
Use of external Open Source packages:
https://github.com/Phlya/adjustText (The MIT License)
https://github.com/nvictus/svgpath2mpl (The 3-Clause BSD License)
We aimed for a visually friendly designs and used color schemes that should be easily identified by people with all types of color vision. We ensured that default fonts are no smaller than 9 points/pixels in all of our plots. Your feedback and suggestions are welcome about how we can continue to improve the accessibility.